theorizing from data video talk by peter norvig

Here is a video lecture by Google's Director of Research - Peter Norvig. The full title of this lecture is "Theorizing from Data: Avoiding the Capital Mistake".

In 1891 Sir Arthur Conan Doyle said that "it is a capital mistake to theorize before one has data." These words still remain true today.

In this talk Peter gives insight into what large amounts of data can do for problems in language understanding, translation and information extraction. The talk is accompanied with a bunch of examples from various Google services.

Moments from the lecture:

  • [00:35] Peter Norvig came to Google from NASA in 2001 because that's where the data was.
  • [01:30] Peter says that the way to make progress in AI (Artificial Intelligence) is to have more data. If you don't have data you won't make progress just with fancy algorithms.
  • [04:40] In 2001 a meta study of several different algorithms for disambiguating words in sentences showed that the worst algorithms performed better than the best algorithms if they were trained with a larger word database. Link to original meta study paper: Scaling to Very Very Large Corpora for Natural Language Disambiguation
  • [06:30] It took at least 30 years to go from a linguistic text collection of 1 million words (10^6 words, Brown Corpus) to what we now have on Internet (around 100 trillion words (10^14 words)).
  • [06:55] Google harvested one billion words (10^12) from the net, counted them up and published them to Linguistics Data Consortium. Announcement here, you can buy 6 DVDs of the words here (the price is $150).
  • [10:00] Example: Google Sets was the first experiment done using large amounts of data. It's a clustering algorithm which returns a group of similar words. Try "dog and cat" and then "more and cat" :)
  • [11:55] Example: Google Trends shows popularity of a search terms based on data collected over time of searches performed by users.
  • [13:15] Example: Query refinement suggestions.
  • [13:40] Example: Question answering.
  • [15:30] Principles of machine reading - concepts, relational templates, patterns.
  • [16:32] Example of learning relations and patterns with machine reading.
  • [18:40] Learning classes and attributes (for example, computer games and their manufacturers).
  • [21:18] Statistical Machine Translation (See Google Language Tools).
  • [24:25] Example of Chinese to English machine translation.
  • [26:27] Main components of machine translation are Translation Model, Language Model and Decoding Algorithm.
  • [29:35] More data helps!
  • [29:45] Problem: How many bits to use to store probabilities?
  • [31:10] Problem: How to reduce space used for storing words from training data during translation process?
  • [35:25] Three turning points in the history of development of information.
  • [37:00] Q and A!

There were some interesting questions in Q and A session:

  • [37:15] Have you applied any of the theories used in stock markets to language processing?
  • [38:08] Are you working on any tools to assist writers?
  • [39:50] How far you off from automated translation without disfluencies?
  • [41:58] 1) Is GOOG-411 service actually used to gather a huge corpus of spoken data. 2) Are there any advances on other data than text?
  • [43:50] Would the techniques you described in your talk work in speech-to-text processing?
  • [44:50] Will there be any services for fighting comment and form spam?
  • [46:00] Do you also take information like what links do users click into account when displaying search results?
  • [47:22] How do you measure difference between someone finding something, and someone being satisfied what they found?
  • [49:23] When doing machine translation, how can you tell that you're not learning from a website which was already translated with another machine translation service?
  • [50:49] How do you take into account that one uses slang, the other does not, and does it affect your translation tools?
  • [51:40] Can you speak a little about methods in OCR (Optical Character Recognition)?

The question at 44:50 got me very interested. The person asked if Google was going to offer any services for fighting spam. Peter said that it was an interesting idea, but it was better to ask Matt Cutts.

Having a hacker's mindset, I started thinking, what if someone emailed their comments through Gmail? If the comment was spam, Gmail's spam system would detect it and label the message as being spam. Otherwise the message would end up in Inbox folder. All the messages in Inbox folder could then be posted back to the website as good comments. If there were false positives, you could go through the spam folder and move the non-spam messages back to Inbox. What do you think?

Have fun!

Comments

oelewapperke Permalink
June 24, 2008, 22:59

People aren't even prepared to accept theoretical knowledge, let alone let computers do their theorizing (because this still generates theories, just opaque ones we can't identify).

What if you ask a data-based computer the question "are gays bad for society ? How much less input can society expect from someone who is gay".

What if the answer is "lots" ?

And there are a million questions like this. Pouring money in AIDS research is nonsense, almost no-one dies from the disease. It would make much more sense to invest in heart-diagnosis/medication, cancer research, and a cheap way to cure pulmonary diseases, which are the real killers :
(WHO statistics @ http://www.who.int/mediacentre/factsheets/fs310/en/index.html)

7.21 Coronary heart disease
5.51 Stroke and other cerebrovascular diseases
3.2 Lower respiratory infections

(data format : millions killed \t disease)

And obviously following the data the US should immediately try to switch from oil to nuclear, and as completely as possible, and AFTER THAT (prioritizing) try to develop solar and fusion.

Amarth Permalink
December 25, 2013, 12:35

We all tend to explain, comment on, or even assess a topic of study in an oral or mostly written form at all the colleges we study in to gain an occupation. Each of us need to find where to buy a research paper that would be a reasonable issue for the next couple of decades, i'm sure of it. Our educational system is generally influenced by all fields of social life as well as politics and science. But luckily high-technologies are getting more sophisticated.

February 27, 2014, 14:57

Pattern recognition pattern recognizer. Godel is laughing somewhere.

Leave a new comment

(why do I need your e-mail?)

(Your twitter name, if you have one. (I'm @pkrumins, btw.))

Type the first letter of your name: (just to make sure you're a human)

Please preview the comment before submitting to make sure it's OK.

Advertisements