<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title>matpalm - Latest Comments</title><link xmlns="http://www.w3.org/2005/Atom" rel="http://api.friendfeed.com/2008/03#sup" href="http://disqus.com/sup/all.sup#forumcomments-b33586c3" type="application/json"/><link>http://matpalm.disqus.com/</link><description></description><atom:link href="http://matpalm.disqus.com/comments.rss" rel="self"></atom:link><language>en</language><lastBuildDate>Thu, 03 May 2012 05:27:54 -0000</lastBuildDate><item><title>Re: brain of mat kelcey</title><link>http://matpalm.com/blog/2010/06/25/friend-clustering-by-term-usage/#comment-517474964</link><description>&lt;p&gt;Of course you did not find any words in common. The WordBags API explicitly gives unique word usage per-user. Simple GIGO really&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Srikant Jakilinki</dc:creator><pubDate>Thu, 03 May 2012 05:27:54 -0000</pubDate></item><item><title>Re: brain of mat kelcey</title><link>http://matpalm.com/blog/2011/08/13/wikipedia-philosophy#comment-480547059</link><description>&lt;p&gt;Well, I think I just noticed the change as 15 minutes ago something that used to take me to philosophy is now taking me to truth... o.0 I think someone changed something.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">John</dc:creator><pubDate>Fri, 30 Mar 2012 05:15:02 -0000</pubDate></item><item><title>Re: brain of mat kelcey</title><link>http://matpalm.com/blog/2011/12/10/common_crawl_visible_text#comment-472732092</link><description>&lt;p&gt;Sorry Kiran I no longer have access to a large enough cluster to do this.&lt;br&gt;The closest I have is &lt;a href="http://bit.ly/GJgylB" rel="nofollow"&gt;http://bit.ly/GJgylB&lt;/a&gt; which is the highest frequency top level domains.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">mat kelcey</dc:creator><pubDate>Wed, 21 Mar 2012 16:08:49 -0000</pubDate></item><item><title>Re: brain of mat kelcey</title><link>http://matpalm.com/blog/2011/12/10/common_crawl_visible_text#comment-472700592</link><description>&lt;p&gt;Hi Matt,&lt;/p&gt;

&lt;p&gt;I would like to have the all the domain names present in the common crawl. I have asked the admins at commoncrawl repeatedly for this without success.&lt;/p&gt;

&lt;p&gt;I would be thrilled if you could create a dump of this data and post it somewhere.&lt;/p&gt;

&lt;p&gt;Thanks in advance.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Kiran J</dc:creator><pubDate>Wed, 21 Mar 2012 15:38:03 -0000</pubDate></item><item><title>Re: sketching</title><link>http://matpalm.com/resemblance/sketching/#comment-466838334</link><description>&lt;p&gt;I could be nice if you update your table with a context triggered pairwise hashing algorithm like ssdeep&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">lalakis</dc:creator><pubDate>Fri, 16 Mar 2012 07:26:47 -0000</pubDate></item><item><title>Re: brain of mat kelcey</title><link>http://matpalm.com/blog/2011/11/15/collocations_3#comment-456358546</link><description>&lt;p&gt;Nice pick up Tom, I did in fact have an error in some parsing which meant I was dropping numbers; stupid silly bug.&lt;/p&gt;

&lt;p&gt;I agree Wikipedia is a bit of a special corpus and lots of people don't appreciate that it's not fully representative of common language, though what is I guess...&lt;/p&gt;

&lt;p&gt;Thanks for the comment!&lt;/p&gt;

&lt;p&gt;Mat&lt;br&gt;&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">mat kelcey</dc:creator><pubDate>Sun, 04 Mar 2012 19:08:32 -0000</pubDate></item><item><title>Re: resemblance with the jaccard coefficient</title><link>http://matpalm.com/resemblance/jaccard_coeff/#comment-456041811</link><description>&lt;p&gt;Using the processor's POPCNT instruction would be much faster than your little Kernighan inspired loop.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Sun, 04 Mar 2012 12:26:48 -0000</pubDate></item><item><title>Re: brain of mat kelcey</title><link>http://matpalm.com/blog/2011/11/15/collocations_3#comment-456027047</link><description>&lt;p&gt;Both of your longest phrase examples seems suspicious to me.  Did you drop all numbers in your initial parse?  Intentionally?&lt;/p&gt;

&lt;p&gt;A little Googling confirms my suspicions that these were actually of the form:&lt;/p&gt;

&lt;p&gt;United Nations Security Council Resolution 1699, adopted unanimously on August 8, 2006, after recalling  &lt;br&gt;As of the census [ 1 ] of 2000, there were 9536 people, 3922households, and 2517 families residing in the city. &lt;/p&gt;

&lt;p&gt;To answer your question about templates vs cut &amp;amp; paste, templates (infoboxes) are excluded from the body text, but this type of stylized pro forma structure is pretty common in Wikipedia.  Some of it's from cut &amp;amp; paste or a single author working on a series of related articles, but often it's a semi-formal convention adopted by a group of authors.&lt;/p&gt;

&lt;p&gt;There are a number of Wikipedia-isms that would be fascinating to study statistically if there was an equivalent corpus to compare against.  For example, I suspect the frequency of the word "notable" is much higher in Wikipedia than elsewhere because they've got a notability requirement for inclusion, so authors writing about marginal cases take pains to stress why their subject is "notable."&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Sun, 04 Mar 2012 12:08:07 -0000</pubDate></item><item><title>Re: sketching</title><link>http://matpalm.com/resemblance/sketching/#comment-451031435</link><description>&lt;p&gt;And one last question. Broder's algorithm propose to hash multiple times each shingle so i.e: shingle1 = {(a,b,c),(e,f)} then according to sketching approach i hash this: H(shingle1) = dig1 then i hash this dig1 again H(dig1) or i rehash shingle1 and i choose the min value from all the digests of this shingle?  &lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">lalakis</dc:creator><pubDate>Tue, 28 Feb 2012 05:19:18 -0000</pubDate></item><item><title>Re: sketching</title><link>http://matpalm.com/resemblance/sketching/#comment-449960443</link><description>&lt;p&gt;i used a simple modulo hash based on some code from "introduction to algorithms". i've actually since found out it has a bug in it and it produces very poorly distributed values, i guess it worked well enough for this experiment :/ this was a few years ago now. i'm not sure you can use SHA hashing as the hash function needs to be seedable. when i run something like this now i use murmur hashing; seedable, fast and very well distributed. &lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">mat kelcey</dc:creator><pubDate>Mon, 27 Feb 2012 10:58:56 -0000</pubDate></item><item><title>Re: sketching</title><link>http://matpalm.com/resemblance/sketching/#comment-449944257</link><description>&lt;p&gt;Which hash algorithm do you use to hash many times the shingles and compute its sketch in the end?I see in bibliography a lot of rabin's fingerprints but why not just a simple SHA-1 won't work?&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">lalakis</dc:creator><pubDate>Mon, 27 Feb 2012 10:33:16 -0000</pubDate></item><item><title>Re: simhash</title><link>http://www.matpalm.com/resemblance/simhash/#comment-446848560</link><description>&lt;p&gt;Besides the length of each n-gram there is another property of construction gthe features which is the sliding window. Did you choose this ramndomly as ewll or there is a rational in this?I.e: &lt;br&gt;{ "Bobs cafe", "cafe is", "is excellent", ... } &lt;br&gt;It is obvious that using the token-ngram scheme where each piece of information is a word the sliding window equals 1. Why i.e you didn;t choose token ngrams of length 3 with a sliding window 2? Is there some logc there?thank you&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">lalakis</dc:creator><pubDate>Thu, 23 Feb 2012 08:02:17 -0000</pubDate></item><item><title>Re: simhash</title><link>http://www.matpalm.com/resemblance/simhash/#comment-430980209</link><description>&lt;p&gt;Can't think why not. It's a larger hash size (SHAs start at 160 bits) which I saw improved accuracy but on the other hand SHAs aren't cheap to calculate.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">mat kelcey</dc:creator><pubDate>Mon, 06 Feb 2012 13:15:34 -0000</pubDate></item><item><title>Re: simhash</title><link>http://www.matpalm.com/resemblance/simhash/#comment-430974140</link><description>&lt;p&gt;for small documents where i've been interested in differences such as punctation or character normalisation (ie is "Bobs cafe" the same as "Bob's café" ?) i've used character ngrams {"Bo", "ob", "bs", "s ", ... }&lt;/p&gt;

&lt;p&gt;for larger documents where i've been more interested in general content i've used token ngrams eg { "Bobs cafe", "cafe is", "is excellent", ... }&lt;/p&gt;

&lt;p&gt;varying the ngram length give different results depending on the overlap of content and the overall document length. there is no magic number.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">mat kelcey</dc:creator><pubDate>Mon, 06 Feb 2012 13:08:53 -0000</pubDate></item><item><title>Re: simhash</title><link>http://www.matpalm.com/resemblance/simhash/#comment-430670575</link><description>&lt;p&gt;Also can you give some more information about the way you construct the shingles? Is there any particular way to construct the shingles or you group words together choosing the just it's size?&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">lalakis</dc:creator><pubDate>Mon, 06 Feb 2012 04:19:13 -0000</pubDate></item><item><title>Re: simhash</title><link>http://www.matpalm.com/resemblance/simhash/#comment-430668741</link><description>&lt;p&gt;Yes it is clear.And can you use any kind of crypto-hash functions like SHA? Or there is a special one with particular features?&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">lalakis</dc:creator><pubDate>Mon, 06 Feb 2012 04:12:01 -0000</pubDate></item><item><title>Re: simhash</title><link>http://www.matpalm.com/resemblance/simhash/#comment-430668046</link><description>&lt;p&gt;Yes it is clear.And can you use any kind of crypto-hash functions like SHA? Or there is a special one with particular features?&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">lalakis</dc:creator><pubDate>Mon, 06 Feb 2012 04:09:23 -0000</pubDate></item><item><title>Re: simhash</title><link>http://www.matpalm.com/resemblance/simhash/#comment-428711376</link><description>&lt;p&gt;It might also help to think of it as a form of dimensionality reduction where the hash functions are used to decide where documents are located in the lower dimensional space.&lt;/p&gt;

&lt;p&gt;The important thing is that the hash functions produces the same result for the same input. &lt;/p&gt;

&lt;p&gt;There is some information lost in the sense you can't go backwards from the hash values to the shingles that generated them but that doesn't matter, all the comparisons are done with the hashed values, not the original shingles.&lt;/p&gt;

&lt;p&gt;Does this help clarify things a bit?&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">mat kelcey</dc:creator><pubDate>Fri, 03 Feb 2012 12:22:01 -0000</pubDate></item><item><title>Re: simhash</title><link>http://www.matpalm.com/resemblance/simhash/#comment-428697098</link><description>&lt;p&gt;I can't understand how simhash works with a good approximation in similarity checking since the initial shingles are hashed using a crypto hash function.&lt;br&gt;That means that there is no correlation between the produced digest and the input. ???&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">lalakis</dc:creator><pubDate>Fri, 03 Feb 2012 12:02:04 -0000</pubDate></item><item><title>Re: brain of mat kelcey</title><link>http://matpalm.com/blog/2011/12/10/common_crawl_visible_text#comment-405643467</link><description>&lt;p&gt;Hey,&lt;br&gt;I've been working out the best way to make something like&lt;br&gt;this publicly available,&lt;br&gt;I'll let you know when it's up.&lt;br&gt;Mat&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">mat kelcey</dc:creator><pubDate>Sun, 08 Jan 2012 18:58:48 -0000</pubDate></item><item><title>Re: brain of mat kelcey</title><link>http://matpalm.com/blog/2011/12/10/common_crawl_visible_text#comment-404898388</link><description>&lt;p&gt;Hi Mat, &lt;/p&gt;

&lt;p&gt;Can you share the 3TB Results with us ?&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Manikandan Paneerselvam</dc:creator><pubDate>Sun, 08 Jan 2012 06:42:33 -0000</pubDate></item><item><title>Re: brain of mat kelcey</title><link>http://matpalm.com/blog/2011/08/13/wikipedia-philosophy#comment-399178980</link><description>&lt;p&gt;I think you mean "The first link in Surface Water Sports i now Skurfing but wasn't at the time you wrote this blog post"&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">mat kelcey</dc:creator><pubDate>Mon, 02 Jan 2012 19:18:32 -0000</pubDate></item><item><title>Re: brain of mat kelcey</title><link>http://matpalm.com/blog/2011/08/13/wikipedia-philosophy#comment-399175615</link><description>&lt;p&gt;yes; not in parentheses, not in italics&lt;br&gt;&lt;a href="https://github.com/matpalm/wikipediaPhilosophy/blob/master/article_parser.py#L81" rel="nofollow"&gt;https://github.com/matpalm/wik...&lt;/a&gt;&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">mat kelcey</dc:creator><pubDate>Mon, 02 Jan 2012 19:10:41 -0000</pubDate></item><item><title>Re: brain of mat kelcey</title><link>http://matpalm.com/blog/2011/08/13/wikipedia-philosophy#comment-399174147</link><description>&lt;p&gt;The first link in Surface Water Sports is Skurfing.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">semininja</dc:creator><pubDate>Mon, 02 Jan 2012 19:07:50 -0000</pubDate></item><item><title>Re: brain of mat kelcey</title><link>http://matpalm.com/blog/2011/08/13/wikipedia-philosophy#comment-399170910</link><description>&lt;p&gt;The alt-text from this comic (&lt;a href="http://xkcd.com/903/)" rel="nofollow"&gt;http://xkcd.com/903/)&lt;/a&gt; says: "[I]f you take any article, click on the first link in the article text not in parentheses or italics, and then repeat, you will eventually end up at 'Philosophy'."&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">semininja</dc:creator><pubDate>Mon, 02 Jan 2012 19:01:43 -0000</pubDate></item></channel></rss>
