In [43]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gensim

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer,TfidfTransformer
import sklearn.feature_extraction.text as text 
from sklearn.linear_model import LogisticRegression
from sklearn import decomposition

from IPython.display import Image
from IPython.display import HTML
from IPython.core import display

%matplotlib inline

from pylab import rcParams

rcParams['figure.figsize'] = 15, 10
 
plt.style.use('fivethirtyeight')
In [6]:
# This line will hide code by default when the notebook is exported as HTML
display.display_html('''<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0)
                { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>''', raw=True)
In [7]:
def output_columns(df,rounding=2):
    '''
    Changes columns name from "python_naming" to "Output Names"
    '''
    
    #keep as a dataframe to perform vectorized string operations
    names = pd.DataFrame(df.columns)
    names.columns= ['temp']
    names = pd.DataFrame(names.temp.str.replace("_", " "))
    df.columns = list(names.temp.str.title().str.replace('usv','USV',case=False))
    return df
In [10]:
posts = pd.read_csv('data/usv_posts_cleaned.csv',encoding='utf8',parse_dates=True)
posters = pd.read_csv("data/usv_posters_cleaned.csv",encoding="utf8")
In [11]:
posts.date_created = pd.to_datetime(posts.date_created)
posters['relation_to_USV'] =  np.where((posters.ever_usver)&(posters.is_usver==False),'Former USVer','Civilian')
posters.ix[posters.is_usver,'relation_to_USV'] = "Current USVer"

What is this?

This is an analysis of USV.com post data from inception to Feb 2015. For a few years, USV.com was my favorite place to hang out on the internet. There was once a section on the website called "Conversation". It looked like this:
Conversation Any user could submit links to an article. The community would comment and upvote them. The links were ranked by comments, upvotes and time submitted. This started as a Hack Day project to understand who was contributing value to the community and topics the community liked to discuss.

There are three sections:

Poster segments: Identify poster segments, who is in them and how they compare against each other.
Post Patterns: Words and topics that are popular with the community, Trends in posting time.
Poster Profiles Words and topics that specific posters prefer.


The data set of posts includes:

  • the poster's twitter handle
  • the post title
  • the user submitted description (body text)
  • the time of the post
  • who upvoted it
  • comments and upvotes received

This is all publicly scrapable, but was given to me during a Hack Day.

Below is an example of the post data set, sorted by comment count:

In [17]:
output_columns(posts.sort_values('comment_count',ascending=False)).head()
Out[17]:
Title Poster Date Created Upvotes Comment Count Voted Users Body Text Body Text Raw Body Text Clean
1193 Bitcoin As Protocol albertwenger 2013-10-31 16:00:51.061 42 216 [u'albertwenger', u'aweissman', u'ppearlman', ... We owe many of the innovations that we use eve... We owe many of the innovations that we use eve... We owe many of the innovations that we use eve...
2540 Winning on Trust nickgrossman 2013-12-24 07:15:10.488 21 88 [u'nickgrossman', u'aweissman', u'albertwenger... Thoughts on why trust will be central to #winn... Thoughts on why trust will be central to #winn... Thoughts on why trust will be central to #winn...
1807 Vote vs. Like vs. Favorite vs. INSERT VERB HERE falicon 2013-11-22 14:28:30.619 12 67 [u'falicon', u'nickgrossman', u'kidmercury', u... Throwing the question about what to call the u... Throwing the question about what to call the u... Throwing the question about what to call the u...
3045 Feedback wanted: new look front page at usv.com nickgrossman 2014-01-21 08:59:51.998 11 67 [u'nickgrossman', u'ron_miller', u'julien51', ... Hi Everyone --\nLast week we started experimen... Hi Everyone --\nLast week we started experimen... Hi Everyone --\nLast week we started experimen...
4178 To Dare is To Do billmcneely 2014-03-17 06:08:44.918 33 62 [u'billmcneely', u'annelibby', u'LonnyLot', u'... Yesterday I was watching a proper football mat... Yesterday I was watching a proper football mat... Yesterday I was watching a proper football mat...

Poster Segments


Top Posters

Here are the stats for the top 10 posters determined by the number of conversation their posts sparked:

In [18]:
poster_metrics = ['poster','post_count','mean_comments','conversations_sparked',
                  "percent_of_posts_with_comments",'relation_to_USV']
output_columns(posters.sort_values('conversations_sparked',ascending=False)[poster_metrics].head(10).round(2))
Out[18]:
Poster Post Count Mean Comments Conversations Sparked Percent Of Posts With Comments Relation To USV
257 aweissman 452 1.52 45 0.43 Current USVer
381 fredwilson 542 1.46 43 0.39 Current USVer
543 nickgrossman 295 2.19 37 0.43 Current USVer
700 wmougayar 394 1.09 30 0.31 Civilian
235 albertwenger 143 2.97 19 0.39 Current USVer
475 kidmercury 277 0.94 17 0.29 Civilian
567 pointsnfigures 574 0.47 16 0.17 Civilian
282 bwats 99 1.90 15 0.36 Former USVer
367 falicon 58 4.26 12 0.52 Civilian
450 jmonegro 178 0.87 11 0.32 Current USVer

All of the data on posters is calculated from the posts's above. I also manually added whether or not the post works at USV or worked at USV as of Feb 2015, which is when I did this. The only difference is that Brittany worked at USV in Feb 2015 and now works at Lattice.vc in May 2016.

In [19]:
# Highest ranked non-USVer posters
output_columns(posters.loc[(posters.ever_usver==False)& 
                           (posters.conversations_sparked>1)].sort_values('conversations_sparked')).to_csv('Top Posters.csv',index=False)

Current and Former USVers

Here are the stats for all current USVers and alumni:

In [20]:
output_columns(posters.ix[posters.ever_usver,poster_metrics].sort_values(['relation_to_USV','post_count']))
Out[20]:
Poster Post Count Mean Comments Conversations Sparked Percent Of Posts With Comments Relation To USV
455 johnbuttrick 33 1.42 2 0.42 Current USVer
278 br_ttany 79 0.91 6 0.27 Current USVer
490 libovness 89 1.58 9 0.36 Current USVer
33 BradUSV 94 1.24 10 0.15 Current USVer
235 albertwenger 143 2.97 19 0.39 Current USVer
450 jmonegro 178 0.87 11 0.32 Current USVer
543 nickgrossman 295 2.19 37 0.43 Current USVer
257 aweissman 452 1.52 45 0.43 Current USVer
381 fredwilson 542 1.46 43 0.39 Current USVer
67 EricFriedman 5 0.00 0 0.00 Former USVer
292 ceonyc 5 0.00 0 0.00 Former USVer
296 christinacaci 17 2.53 3 0.47 Former USVer
385 garychou 18 4.00 3 0.50 Former USVer
11 AlexanderPease 49 1.53 3 0.39 Former USVer
282 bwats 99 1.90 15 0.36 Former USVer
In [21]:
# Create a csv of all USVers
output_columns(posters.loc[posters.ever_usver==True].sort_values('conversations_sparked',ascending=False)).to_csv("USVers by the Numbers.csv",index=False)

Infrequent, but High Value Posters

This is a segment of posters that post infrequently, but have high per post engagment. This is a group that the community wants to hear more from. The cutoffs are:

  • Averages more than 2 comments per post
  • Posted between 5 and 15 times

These cutoffs, like all cutoffs, are somewhat arbitrary. They felt directionally correct to me.

In [22]:
output_columns(posters.ix[(posters.mean_comments>2)&
           (posters.post_count>=5)&
           (posters.post_count<=15),poster_metrics].sort_values('mean_comments'))

# Uncoment and move up to output as a CSV
# .to_csv('Occasional, but Valuable Posters.csv',index=False)
Out[22]:
Poster Post Count Mean Comments Conversations Sparked Percent Of Posts With Comments Relation To USV
27 BenedictEvans 13 2.08 2 0.38 Civilian
349 ebellity 10 2.20 2 0.60 Civilian
293 cezinho 6 2.33 1 0.50 Civilian
594 rrhoover 14 2.43 3 0.64 Civilian
283 bwertz 9 2.67 3 0.67 Civilian
559 patrickjmorris 11 2.73 2 0.45 Civilian
229 adsy_me 12 2.75 1 0.42 Civilian
496 manuelmolina 6 2.83 1 0.83 Civilian
673 tomcritchlow 12 2.92 3 0.58 Civilian
456 johnfazzolari 5 3.00 1 0.40 Civilian
77 GeoffreyWeg 10 3.30 3 0.50 Civilian
323 davewiner 7 3.57 2 0.71 Civilian
522 moot 10 3.70 4 0.80 Civilian
568 ppearlman 11 3.73 2 0.27 Civilian
599 ryaneshea 5 4.80 2 0.80 Civilian
696 whitneymcn 5 5.00 2 0.40 Civilian
135 MsPseudolus 15 6.33 4 0.60 Civilian
460 jordancooper 7 6.57 2 0.57 Civilian
In [23]:
posts = posts.merge(posters[['poster','is_usver','ever_usver']],on='poster',how='left')

USVers vs Civilians

This is the average number of conversations sparked by posters based on their relationship to USV:

In [24]:
sns.barplot('relation_to_USV','conversations_sparked',data=posters, palette="Blues_d",estimator=np.mean,ci=None)
sns.plt.title("Current and Former USVers vs Civilians")
sns.axlabel("Relationship to USV","Average Conversations Sparked")
plt.ylim(0,35)
sns.plt.savefig("USVers vs Non-USVers (Top Posts).png", dpi=100)

Current USVers tend to spark more conversations. Could that be because the civilian category is weighed down by inactive accounts?

Here is amount of current USVers, former USVers and Civilains who have posted.

In [25]:
posters.groupby('relation_to_USV').count()[['poster']].rename(columns={'poster':'Number of Unique Posters'})
Out[25]:
Number of Unique Posters
relation_to_USV
Civilian 699
Current USVer 9
Former USVer 6

There are a lot more Civialians. Perhaps, it's not surpising that their average was low. Below is the same plot, except it gets the maxium conversations sparked, instead of the average for each group.

In [26]:
sns.barplot('relation_to_USV','conversations_sparked',data=posters, palette="Blues_d",estimator=np.max,ci=None)
sns.plt.title("Current and Former USVers vs Civilians")
sns.axlabel("Has Worked For USV","Max Conversations Sparked in Each Group")
plt.ylim(0,35)
sns.plt.savefig("USVers vs Non-USVers (Top Posts).png", dpi=100)

As you can see, some Civilians (or at least one) have sparked a comparable number of discussions to the USVers. The average USVers does spark more conversations than the average civilian. However, the most active civilians are comparable to USVers.

Post Patterns


Post Enagagement by Time Posted

Does it matter what time of day a post is made?

In [27]:
posts.groupby(posts.date_created.dt.hour).mean()[['upvotes','comment_count']].plot(legend=True)
plt.ylim(0,5)
plt.title('USV.com -- Average Upvotes and Comments by Hour of the Day')
plt.xlabel('Hour of the Day')
plt.ylabel('Average Count')
Out[27]:
<matplotlib.text.Text at 0x1104a6c50>

There is not a strong pattern between when something is posted and the number of upvotes and comments it recieves.

This is what the same graph looks like for Product Hunt (data pulled from their API):

In [28]:
Image(filename='images/Counts-and-Upvotes-by-Hour-Product-Hunt.png') 
Out[28]:

Product Hunt has s strong pattern based on time of day. This is because Product Hunt resets every day. USV.com does not.

This could indicate an opportunity. If USV.com built in a predictable time cycle for new posts, it could lead to habitual visits, posts and discussions. Reddit and Hacker News show you don't need a daily leaderboard that restart at midnight. However, people should expect to see new content at some time interval. AVC does this very well.

In [29]:
##
# posts.set_index('date_created')['comment_count'].resample('M').count()[:-1].plot()
# That is the evolution of montly posts over time. USV.com was primarly a blog for occasional 
# posts from 2006 - 2013. You can see when USV.com allowed anyone to submit links. By number of 
# posts, this peaked in 2013. However, I think number of posts is the wrong metric.

Posts The Community Likes To Discuss

First, we must define "likes to discuss". One way to define it is by number of comments. Here is the distribution of posts by the number of comments recieved:

In [30]:
posts.comment_count.loc[posts.comment_count>=0].hist(bins=250)
plt.xlabel('Number of Comments')
plt.ylabel('Number of Posts')
plt.title('Distribution of Comments')
plt.savefig('Distribution of Comments.png')
plt.xlim(0,50)
print "Comment Count Skew: " + str(posts.comment_count.skew().round(2))
Comment Count Skew: 24.05

As can be seen above, most posts get 0 or very few comments. The skew is very postive (23+). Let's define a popular post as one that sparks a discussion by getting at least 5 comments.

In [31]:
# posts.comment_count.value_counts(normalize=True).round(4).sort_index()*100
# That is the distribution of comments. As you can see, 73% of posts got no comments. 90% got 3 or fewer comments. 
In [32]:
top5_cutoff = posts.comment_count.quantile(.95)
# print "The top 5%% of posts got at least %s comments so that will be the cutoff." %int(top5_cutoff)
In [33]:
posts['sparked_conversation'] = posts.comment_count>top5_cutoff
posts['got_comments'] = posts.comment_count>0
In [34]:
posts[['title','body_text']] = posts[['title','body_text']].fillna('')

Optional Explanation of Math

Now, I will use the TF-IDF* weighting of one and two word phrases that appear in at least 20 posts titles. TF-IDF, is just the word counts in each title relative to how rare the words are across all titles.

In [72]:
vec = CountVectorizer(ngram_range=(1,2),min_df=20,stop_words='english')
X_words = vec.fit_transform(posts.title)
In [73]:
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X_words)

Next, I fit a logistic regression model using l1 regularization. This uses the tfidf words to predict whether or not the post will spark a conversation. Logistic regression is not the most predictive model, but it is interpretable, which is the goal of this.

Note that data set is small (relative to the number of features), which makes it hard to do proper cross validation. Instead, I am using regularization and requiring that a phrase appears 20+ times to reduce overfitting. Even so, I wouldn't try to make predictions with this model.

In [74]:
y = posts.sparked_conversation
model = LogisticRegression(penalty='l2')
model.fit(tfidf,y)
print 

In [75]:
vocab = zip(vec.get_feature_names(),
            model.coef_[0])

df_vocab = pd.DataFrame(vocab)

Popular Title Words

The higher the coef, the more likely a word is to spark a conversation. Here are the top 20 words:

In [39]:
df_vocab.columns = ['word','coef']
df_vocab.sort_values('coef',ascending=False).head(20)
Out[39]:
word coef
315 usv 2.772766
147 introducing 1.483714
20 ask usv 1.309939
264 self 1.247119
43 car 1.203155
53 code 1.181710
89 economy 1.093197
275 snapchat 1.079421
168 let 1.054378
80 did 1.043933
306 trust 1.012399
216 open 0.995027
328 vs 0.971555
44 case 0.968341
30 bitcoin 0.965480
64 crowdfunding 0.961925
270 silicon 0.959581
163 lead 0.949660
82 disruption 0.930395
92 employees 0.894577

Topics on USV.com

Above we looked at the titles. Now, let's check out the post description. By using non-negative matrix factorization, I tried to pull out topics.

The post descriptions have too few words for perfectly reliable topic models. Many topics were sensitive to hyperparamters. Even with all of the noise, there were still some topics that stood out.

Below are some topics and their top 10 words. I did not use tags to find these. This came solely from which words appear together in descriptions.

In [40]:
posts['body_text_raw'] = posts.body_text
posts['body_text_clean'] = posts.body_text.str.replace(r'\[a-z][a-z][1-9]\[a-z][a-z][a-z]', '', case=False)
posts['body_text_clean'] = posts.body_text_clean.str.replace('\'', '', case=False)
posts['body_text_clean'] = posts.body_text_clean.str.replace('[a-z]/[a-z]', '', case=False)
In [44]:
custom_stopwords= ['looks','look','read','great','good','dont','really','done','kik','lets',
           'http','let','just','that','thats','like','lot','interesting','think','im',
           'thought','thoughts','id','love','twitter']

my_stop_words = text.ENGLISH_STOP_WORDS.union(custom_stopwords)
In [45]:
# This step performs the vectorization,
# tf-idf, stop word extraction, and normalization.
# It assumes docs is a Python list,
#with reviews as its elements.
cv = TfidfVectorizer(ngram_range=[1,1],max_df=0.6, min_df=4,stop_words=my_stop_words)
doc_term_matrix = cv.fit_transform(posts.body_text_clean)
 
# The tokens can be extracted as:
vocab = cv.get_feature_names()
In [46]:
#trial and error got me to 45
num_topics = 45
#doctopic is the W matrix
decomp = decomposition.NMF(n_components = num_topics, random_state=50,init = 'nndsvda')
doctopic = decomp.fit_transform(doc_term_matrix)
In [47]:
n_top_words = 10
topic_words = []
for topic in decomp.components_:
    idx = np.argsort(topic)[::-1][0:n_top_words]
    topic_words.append([vocab[i] for i in idx])
In [48]:
topic_names = [
    "Web Services", "Bitcoin and Blockchain","AVC or Continuations","Customer Success",
    "Mobile","USV Community","Startup Ecosystems","Data Privacy and Security",0,"Net Neutrality",1,"Long Read",2,
    "Venture Capital",3,"Tech Job Market","HTML Tags","Internt Access",4,"Markets","Test Post",
    "Big 4 Tech Co's","Linux & Cloud","Business Models","App Store",5,6,7,8,9,10,11,"iOS vs Android",12,
    "AVC Posts",13,14,"Community Feedback","Technology and Patents","Video","Startup Building",15,
    "Open Source","Product Development",16
              ]
In [49]:
#outputs all named topics and its top 10 words
for count,i in enumerate(topic_words):
    if isinstance(topic_names[count],int):
        pass
    else:
        print "Topic: %s"%topic_names[count]
        print "Top words: " + ", ".join(i)
        print
Topic: Web Services
Top words: web, services, service, users, social, network, information, networks, content, media

Topic: Bitcoin and Blockchain
Top words: bitcoin, currency, blockchain, exchange, coinbase, money, mining, transactions, digital, value

Topic: AVC or Continuations
Top words: post, blog, alberts, freds, wrote, rand, comments, founder, earlier, news

Topic: Customer Success
Top words: customer, success, customers, gainsight, management, service, important, marketing, satisfaction, successful

Topic: Mobile
Top words: mobile, app, design, facebook, future, device, experience, devices, market, messaging

Topic: USV Community
Top words: usv, com, posting, cross, www, avc, mikecollett, fred, list, team

Topic: Startup Ecosystems
Top words: startup, ecosystem, build, life, hear, story, mistakes, founders, key, thinking

Topic: Data Privacy and Security
Top words: data, privacy, users, security, science, market, messenger, point, visualization, using

Topic: Net Neutrality
Top words: net, neutrality, fight, fcc, different, case, piece, marc, thing, end

Topic: Long Read
Top words: nice, essay, analysis, interview, kickstarter, amazon, piece, transparency, equity, write

Topic: Venture Capital
Top words: startups, invest, vcs, investors, advice, growth, businesses, founders, money, point

Topic: Tech Job Market
Top words: tech, job, nyc, women, culture, future, talks, innovation, ny, chicago

Topic: HTML Tags
Top words: div, class, story, review, section, media, social, gives, footer, image

Topic: Internt Access
Top words: internet, things, access, online, security, future, fcc, privacy, freedom, public

Topic: Markets
Top words: android, microsoft, apple, ios, windows, patent, operating, phone, end, market

Topic: Test Post
Top words: testing, security, usvs, real, conversation, simulations, platform, list, cloudflare, albert

Topic: Big 4 Tech Co's
Top words: google, search, results, facebook, https, apple, nsa, amazon, car, reading

Topic: Linux & Cloud
Top words: cloud, linux, red, hat, enterprise, openstack, latest, security, ubuntu, storage

Topic: Business Models
Top words: business, model, start, help, analysis, models, end, small, businesses, simple

Topic: App Store
Top words: apps, list, native, platform, social, different, million, users, developers, wonder

Topic: iOS vs Android
Top words: company, building, portfolio, team, start, things, culture, run, amazing, acquired

Topic: AVC Posts
Top words: today, talk, avc, wrote, amazing, amazon, share, brief, youtube, announced

Topic: Community Feedback
Top words: community, feedback, value, ownership, sharing, online, building, interested, equity, invite

Topic: Technology and Patents
Top words: technology, software, innovation, patent, patents, piece, disruptive, peer, impact, education

Topic: Video
Top words: video, content, interview, marketing, media, watch, music, youtube, week, creators

Topic: Startup Building
Top words: companies, capital, investors, venture, market, vc, investment, angel, investing, funding

Topic: Open Source
Top words: open, source, software, project, platform, foundation, yes, code, projects, networking

Topic: Product Development
Top words: product, building, products, amazing, hunt, team, design, important, growth, user

In [51]:
posts['clean_upvotes']= posts.voted_users.str.replace('^u|\'|\[|\]','')
users_votes = posts.clean_upvotes.str.get_dummies(",")
In [52]:
users_votes.columns = users_votes.columns.to_series().str.replace('^u','')
In [53]:
posts = pd.concat([posts,users_votes],axis=1)
In [54]:
model = LogisticRegression(penalty='l2')
y = posts.sparked_conversation
model.fit(doctopic,y)
print

This is how well a topic predicts if the community will discuss it. This is grossly simplifying. Topic popularity should changes over time. Different user segments probably have different preferences. Even so, it's quite interesting.

In [55]:
community_topics = pd.DataFrame([pd.Series(topic_names),model.coef_[0]]).T.sort_values(1,ascending=False)
community_topics.columns=["Topic","Coefficent"]
community_topics[community_topics.Topic.map(lambda x: type(x))!=float]
Out[55]:
Topic Coefficent
5 USV Community 2.79843
1 Bitcoin and Blockchain 2.09516
2 AVC or Continuations 2.0201
37 Community Feedback 1.0116
24 App Store 0.85171
4 Mobile 0.708065
17 Internt Access 0.5402
34 AVC Posts 0.470501
0 Web Services 0.452506
38 Technology and Patents 0.353975
43 Product Development 0.262379
42 Open Source 0.197744
21 Big 4 Tech Co's 0.191889
40 Startup Building 0.134794
39 Video 0.115053
13 Venture Capital 0.0783075
9 Net Neutrality 0.0504006
6 Startup Ecosystems 0.0124453
23 Business Models -0.121243
11 Long Read -0.123889
19 Markets -0.135989
32 iOS vs Android -0.159303
20 Test Post -0.198601
15 Tech Job Market -0.270777
7 Data Privacy and Security -0.282444
16 HTML Tags -0.289521
3 Customer Success -0.339316
22 Linux & Cloud -0.600329

Finding Posts on a Topic

We can use these topic models to find posts on a topic. For example, "Bitcoin and Blockchain" is a clear topic. These are the "most Bitcoin" posts. These posts are sorted by how strongly it's description is associated the Bitcoin Topic. This could be used for a recommendation engine. However, not all topics came through as cleanly as Bitcoin.

In [56]:
doctopic_df = pd.DataFrame(doctopic)
doctopic_df.columns = topic_names
posts_topics = pd.concat([posts,doctopic_df],axis=1)
In [57]:
topic = "Bitcoin and Blockchain"
posts_topics.sort_values( topic,ascending=False)[['poster','title',"body_text",topic]].head(20)
Out[57]:
poster title body_text Bitcoin and Blockchain
3683 TomLabus Nakamoto Bitcoin Defense In Bitcoin 0.531392
4210 pointsnfigures Burger King to Add Mobile Payments on Cell Phones precursor to Bitcoin Burgers? 0.531392
2889 fredwilson A VC: Bitcoin - Getting Past Store Of Value an... Some thoughts on where we go next with bitcoin 0.531392
1723 kidmercury Bitcoin From Over $900 to Under $540 in Less t... #bitcoin #hft 0.531392
5349 kidmercury Bots were responsible for bitcoin’s stratosphe... The problems of bitcoin anarchy 0.338237
7762 kidmercury Why Bitcoin Matters (Mini-Documentary) a good dose of bitcoin propaganda is just what... 0.331816
4579 wmougayar Search Engine DuckDuckGo Integrates Bitcoin Pr... DDG goes Bitcoin 0.329520
248 fredwilson Coinbase We have been thinking about and looking to mak... 0.309345
5532 N_Clemmons Coinsis: Bitcoin Credit card [DEMO VIDEO] Send Bitcoin by email\nPay with a bitcoin cred... 0.299673
1303 wmougayar You can now buy a car with Bitcoin in Australia 2 weeks ago, it was "you can buy a beer in Ams... 0.296406
7321 albertwenger BitQuest - The first minecraft server with bit... Interesting use case for Bitcoin 0.293877
6733 pointsnfigures If You Use Facebook, Yelp, Reddit, You Should ... interesting thought about how to use bitcoin b... 0.289755
4937 fredwilson The Pied Piper Effect – AVC Some thoughts on MIT and Bitcoin 0.287514
1477 aweissman Twitter / marcprecipice: Corner bodega, Brookl... 10% off if you pay with Bitcoin 0.285954
3385 pointsnfigures Bitcoin: Store of Value? is bitcoin a store of value? 0.283853
5975 EllieAsksWhy Just a Little Bit More Bitcoin Trouble There has been so much tumult in the bitcoin a... 0.278721
4836 christinacaci Why in Satoshi's name would you want a bitcoin? It seemed bitcoin could stand to be a little m... 0.278001
5215 wmougayar What Block Chain Analysis Tells Us About Bitcoin Includes some interesting graphs on bitcoin de... 0.274111
3563 pointsnfigures Good Sign For Future of Bitcoin 10% of all porn paid for with Bitcoin. 0.272462
1345 nickgrossman DarkWallet Aims To Be The Anarchist's Bitcoin ... DarkWallet is an effort to further anonymize b... 0.269487

Poster Profiles


Here I use the title and topics to predict if a poster will upvote or share something.

That creates a profile of topics and title words a poster likely finds interesting.

Here are some examples:

In [58]:
# Remake the tfidf word matrix with a threshold of 10 instead of 20 counts. 
vec_user_profiles = TfidfVectorizer(ngram_range=(1,1),min_df=10,stop_words='english')
X_words = vec_user_profiles.fit_transform(posts.title)
In [61]:
def get_user_profile(user):
    model_topics = LogisticRegression(penalty='l2')
    y = users_votes[user]
    model_topics.fit(doctopic,y)
    
    vocab_topics = zip(topic_names,model_topics.coef_[0])
    df_vocab_topics = pd.DataFrame(vocab_topics)
    df_vocab_topics.columns = ['topic','coef']
    df_vocab_topics = df_vocab_topics[(df_vocab_topics.coef>0)&
                                      (df_vocab_topics.topic.map(lambda x: type(x))!=int)]
    
    model_words = LogisticRegression(penalty='l2')
    y = users_votes[user]
    model_words.fit(X_words,y)
    
    vocab = zip(vec_user_profiles.get_feature_names(),model_words.coef_[0])

    df_vocab = pd.DataFrame(vocab)
    df_vocab.columns = ['word','coef']
    df_vocab = df_vocab[df_vocab.coef>0]
    
    print user
    print
    print "Favorite Topics: " +\
    ", ".join(df_vocab_topics.sort_values('coef',ascending=False).head(10)['topic'])
    print 
    print "Favorite Words: " +\
    ", ".join(df_vocab.sort_values('coef',ascending=False).head(25)['word'])
    return None
In [62]:
get_user_profile('nickgrossman')
nickgrossman

Favorite Topics: Net Neutrality, Internt Access, Data Privacy and Security, USV Community, AVC Posts, Video, Test Post, Open Source, HTML Tags, Markets

Favorite Words: nytimes, comcast, fcc, techdirt, hunch, ignore, test, grossman, neutrality, nick, slow, internet, obama, privacy, anti, washington, health, verizon, surveillance, cities, uber, free, snowden, trust, policy
In [63]:
get_user_profile('albertwenger')
albertwenger

Favorite Topics: Open Source, Web Services, Technology and Patents, Startup Building, Internt Access, Net Neutrality, AVC Posts, Data Privacy and Security, AVC or Continuations, Mobile

Favorite Words: continuations, hiring, foursquare, com, wired, update, human, medium, wattpad, revolution, longer, hunt, org, brand, income, computer, age, 500, network, analysis, notes, mit, marketplaces, news, platform
In [64]:
get_user_profile('BenedictEvans')
BenedictEvans

Favorite Topics: Markets, Mobile, Internt Access, Tech Job Market, Product Development, Big 4 Tech Co's

Favorite Words: android, benedict, price, really, mobile, twitter, instagram, self, iphone, youtube, use, scale, evans, dead, facebook, foundation, value, technology, networks, does, platform, social, google
In [65]:
get_user_profile('fredwilson')
fredwilson

Favorite Topics: AVC or Continuations, AVC Posts, USV Community, Startup Building, Bitcoin and Blockchain, iOS vs Android, Long Read, Web Services, Business Models, Net Neutrality

Favorite Words: avc, vc, cliche, techcrunch, panel, continuations, coinbase, week, kik, kickstarter, wsj, talk, friday, blog, albert, evans, twitter, benedict, foursquare, looking, thoughts, capitalism, sessions, atlantic, duckduckgo
In [66]:
get_user_profile('aweissman')
aweissman

Favorite Topics: Long Read, Data Privacy and Security, App Store, Big 4 Tech Co's, Open Source, Internt Access, Startup Building, Mobile, USV Community

Favorite Words: music, yorker, medium, youtube, blog, angellist, com, peer, banks, app, evans, circle, ben, fiber, code, life, indie, funding, ve, beat, billion, bloomberg, free, economist, watch
In [67]:
get_user_profile('jmonegro')
print 
print "The words worked much better than the topics for Joel"
jmonegro

Favorite Topics: 

Favorite Words: decentralized, blockchain, plan, distributed, wireless, communication, secure, messaging, based, market, 000, paypal, firm, paid, devices, bitcoin, post, stock, hard, email, change, sharing, trends, time, washington

The words worked much better than the topics for Joel
In [68]:
get_user_profile('pointsnfigures')
pointsnfigures

Favorite Topics: Bitcoin and Blockchain, Tech Job Market, Startup Ecosystems, Venture Capital

Favorite Words: chicago, drones, entrepreneurship, tech, good, bitcoin, football, women, robots, corporate, trading, robot, life, competition, federal, old, big, drone, act, make, invest, angel, social, times, wrong
In [69]:
get_user_profile('kidmercury')
kidmercury

Favorite Topics: Bitcoin and Blockchain, Big 4 Tech Co's, Technology and Patents, Internt Access, Community Feedback, Net Neutrality, Video, Markets, Data Privacy and Security, Mobile

Favorite Words: verge, hedge, techcrunch, com, news, amazon, zero, insider, state, bitcoin, technology, disruptive, nsa, currency, code, bank, man, google, says, computers, wants, samsung, china, bloomberg, wsj