口コミデータを活用したレコメンドシステムの可能性

Data Science

2017.12.7

Topics

はじめに

アマゾンや楽天をはじめとするネット通販は現代人の生活にとって欠かせない存在になってきました。このようなe-コマースサービスでは、顧客満足度の向上と売上の増加という2つの目標を達成するために「 レコメンドシステム」を活用することが一般的です。
レコメンドシステムはその手法によって下記の2種類に分類することが可能です。

  1. 購買パターンが類似する顧客の情報を利用する「協調フィルタリング」
  2. 商品の情報を利用する「コンテンツ・ベース・フィルタリング」

ここでは、商品の口コミデータとトピックモデリングを活用して商品間の関係性を調べ、コンテンツ・ベース・フィルタリング基盤のレコメンドシステムへの応用可能性を探ってみます。
 

準備するモノ

  1. 起動環境
    1. Python 3.x (Jupyter notebookを利用)
    2. Python ライブラリ
      1. Pandas
      2. Numpy
      3. NLTK
      4. Gensim
      5. Requests
  2. データ
    1. Stanford Network Analysis Project (SNAP)のAmazon Reviewsデータ

Amazon Reviewsデータは二種類(古・新)ありますが、ここでは新しいバージョンのデータを使いました。また、全データの取得には手続きが必要なため、実験ではリンクから直接ダウンロードができるサンプル(“Small” subsets for experimentationセクションの下にある食品カテゴリーのデータ)を使いました。サンプルデータとは言え、約15万レビューが含まれています。
中の形式は下記のようなカラムになっています。

  • asin: string – product id
  • helpful: array[long]
  • overall: double
  • reviewText: string
  • reviewTime: string
  • reviewerID: string
  • reviewerName: string
  • summary: string
  • unixReviewTime: 10 digits

 

「口コミデータ」と「トピックモデリング」を利用した類似商品の推定

準備段階

テキスト処理とトピックモデリングのために必要なライブラリをインポートします。requestsライブラリはAmazonから商品情報を取得するために使います。

# ログライブラリ
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logger = logging.getLogger('main')
logger.setLevel(logging.INFO)
# 重要ライブラリ
import sys
import os
import gzip
import time
import shutil
import pandas as pd
from nltk import sent_tokenize, word_tokenize, WordNetLemmatizer
from nltk.corpus import stopwords
from gensim import models, corpora, similarities
# 結果のvisualizationのために必要なライブラリ
import requests
from IPython.display import display, Markdown, Latex
from lxml import html

 

生データの読み込み

データはjson型式で保存されているのでPandasから簡単に読み込むことができます。1商品が1行のjsonオブジェクト(jsonlines)になっていますので、読み込み時にlines=Trueオプションを使うことを忘れないようにしましょう。

data_path = '/data/amazon-product-reviews/reviews_Grocery_and_Gourmet_Food_5.json.gz'
# 1行が1個のJSONオブジェクトになっているjson linesフォマット
with gzip.open(data_path, 'rt') as f:
    amz_food = pd.read_json(f, lines=True)

読み込まれたデータは下記のようになっています。

 

商品の表現手法

各商品は顧客が投稿した口コミの集合として表現しています。しかし、商品によって口コミの数が異なることや評判が悪い場合もあります。どちらも最終目的であるレコメンドの精度に悪影響を及ぼすと思われます。そこで、二つの条件を利用してデータのフィルタリングを行います。

  1. 評判が4点以上の口コミを利用
  2. 最新の口コミ10件を利用 (10件未満の商品の場合は使用しない)
def get_latest_n_reviews(df):
    # 最新のレビュー10個のASINを探す
    latest_asins = set(df['unixReviewTime'].nlargest(10).index)
    # 10個のレビューをlistとして返す
    res = df.loc[latest_asins]['reviewText'].tolist()
    return res
def build_product_documents(df, score_threshold=4.0, topn=10):
    logger.info('#records in original df: %d', len(df))
    # 4点以上のレビューを10個以上持つ商品を探す
    review_cnt = amz_food[amz_food['overall'] >= score_threshold].groupby('asin').size()
    target_asins = set(review_cnt[review_cnt >= topn].index)
    df_filtered = df[df['asin'].isin(target_asins)]
    logger.info('#records in filtered df: %d', len(df_filtered))
    # ['asin', 'reviews'] dataframeを生成
    res = df_filtered\
              .groupby('asin').apply(get_latest_n_reviews)\
              .to_frame('reviews').reset_index()
    return res
# ASIN・最新レビューのデータを抽出
df_reviews = build_product_documents(amz_food, 4.0, 10)

生成されたdf_reviews DataFrameには下記のように二つのカラムが存在します。asinは文字列タイプのカラムで商品IDが、reviewsはアレイタイプのカラムで口コミが10件入っています。

 

LDAを用いたトピックモデリングの遂行

続けて、口コミで表現した各商品間の類似度をトピックモデリングを利用して計算します。ここではよく使われているGensimというライブラリを利用しています。
最初にレビューテキストを単語単位に分割します。

def reviews2tokens(reviews, wml, wn_stopwords):
    tokens = []
    for review in reviews:
        for sent in sent_tokenize(review):
            for word in word_tokenize(sent):
                # 小文字化
                word = word.lower()
                # Stop-wordは削除
                if word in wn_stopwords:
                continue
                # Lemmaを利用
                word = wnl.lemmatize(word)
               tokens.append(word)
    return tokens
# レビューテキストを単語に分解
# Stopwordは除外
wnl = WordNetLemmatizer()
wn_stopwords = set(stopwords.words('english'))
df_reviews['review_tokens'] = df_reviews['reviews'].apply(
    lambda doc: reviews2tokens(doc, wnl, wn_stopwords))

そして、単語の列を(単語ID, 出現頻度)の列に変換します。これがトピックモデルの学習データになります。

token_dic = corpora.Dictionary()
token_dic.add_documents(df_reviews['review_tokens'])
df_reviews['bow'] = df_reviews['review_tokens'].apply(
    lambda tokens: token_dic.doc2bow(tokens))

最後にトピックモデルを学習します。この時、結果の再現性を保つためにモデルを初期化する時に使われるrandom_stateパラメーターを固定しました。

num_topics    = 30 # topic数: 50
chunksize     = 100 # バッチサイズ
passes        = 5 # データ繰り返し数
random_state  = 77777 # 再現性のためランダム値を固定
workers       = 8 # 並列処理: 8 workers
m_lda = models.LdaMulticore(
    workers=workers,
    corpus=df_reviews['bow'],
    id2word=token_dic,
    num_topics=num_topics,
    chunksize=chunksize,
    passes=passes,
    random_state=random_state)

類似商品の検索と比較

ある商品の口コミと類似した口コミを持つ商品ってどのようなものがあるのでしょうか?その結果を検証するために特定の商品(グルテンフリーオートミール)を選択し、それと類似度が高い商品10個をみてみました。
アマゾンから実際の商品情報を取得するために下記の関数を定義した。

def markdown_output(query, sim, printReview=True):
    """Visualization function."""
    product_url = 'http://www.amazon.com/dp/' + query.asin
    header = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'}
    # Generate information to print.
    max_retries = 10
    current_retry = 0
    b_success = False
    htmlbody = []
    while current_retry <= max_retries:
        response = requests.get(product_url)
        if response.status_code == 200:
            node = html.fromstring(response.content)
            product_name = node.xpath('//span[@id="productTitle"]')[0].text_content().strip()
            landingImage = node.xpath('//img[@id="landingImage"]')[0].get('src')
        htmlbody = ['* ASIN: ' + query.asin,
                    ' * Similarity to the query product: ' + str(sim),
                    ' * Product Name: ' + product_name,
                    ' * Product URL: ' + product_url,
                    ' * Amazon product image: <img src="{}" style="height: 400px;" />'.format(landingImage)]
        if printReview:
            for idx, review in enumerate(query.reviews):
                htmlbody.append(' * Review %d: %s' % (idx, review))
        b_success = True
        break
    else:
        current_retry += 1
        logger.info('Retrying to get Amazon product information: %s, %d',
                     query.asin, current_retry)
        time.sleep(5)
    if not b_success:
        htmlbody = ['* ASIN: ' + query.asin,
                    ' * Failed to find the product from www.amazon.com/dp/' + query.asin]
    # Show it in Markdown.
    display(Markdown("\n".join(htmlbody)))

実験の対象とし「Chex Gluten Free Oatmeal Variety Pack, 8.84 Ounce」という商品を選択しました。

qidx = df_reviews.index[-1]
raw_query = df_reviews.iloc[qidx]
markdown_output(raw_query, 1.0)


 

  • Review 0: I love oatmeal, and I’m always up to trying a new brand or variety. So when this Gluten Free Chex Oatmeal was offered on Vine I selected it right away. I’m glad I did. And in the process I learned something, which is always a good thing. First of all, when I saw the name of the product, “Gluten Free Chex Oatmeal,” I thought, “Sound like marketing hype to me. Isn’t all oatmeal gluten free?” Well, instead of just writing up a snarky comment along those lines I went out and did some research. It turns out that, no, all oatmeal is not gluten free. Yes, oats themselves are gluten-free, but apparently most oatmeal contains oats that have been cross-contaminated with a tiny bit of wheat, barley and/or rye. (Thanks, Google.)So, let’s try it out. This is a very good oatmeal without a lot of added nonsense. It doesn’t even have added vitamins and minerals that most cereals have nowadays. There are three flavors in this variety pack: apple cinnamon, maple brown sugar, and original. The original tastes like, well, plain oatmeal. If you like plain oatmeal you’ll love this. The nice oat flavor with no off flavors or strange aftertaste. The same goes for the flavored varieties. The apple cinnamon had discernible chunks of rehydrated apples as well as a nice aroma and flavor. Maple brown sugar was equally tasty.I prefer a thicker porridge, so the recommended amount of water per packet, two thirds of a cup, was about double what I would normally use. The instructions even say you can use an empty pouch to measure the COLD water for pouring it into a microwave-safe cup. For my preparations, I went a simpler route: I poured two packets into aSistema Soup Mugand added 6 ounces of water from my Keurig machine. It came out perfect. I liked it so much I hogged it all to myself, and didn’t even offer Mrs. Boilermate any.Bottom LineThis oatmeal is what theNature Valley OatmealI recently tried should have been. It has simple ingredients, no artificial anything, is easy to prepare, and is delicious. I would definitely purchase this product. Even if you don’t need to maintain a gluten-free diet, this oatmeal deserves a spot in your pantry.
  • Review 1: My daughter and husband are both gluten free and they love this oatmeal. They have both had struggles with cross contaminated oats, and it’s so nice to have a gluten free option that doesn’t cost an arm and a leg. We’ve already made the switch to this product. I would love to see more flavor varieties.
  • Review 2: I don’t usually eat quick-cooking oats, but sometimes it’s nice to keep a few packs of oatmeal in my desk or cupboard for a quick snack. The pouches of oatmeal in this multi-pack are no where near big enough for a full meal, but they certainly can serve as a nice snack.I loved the apple cinnamon oatmeal, which had real chunks of dehydrated apples. It was tasty and not too sweet. The maple syrup oatmeal, on the other hand, was sickly sweet, and I had a hard time getting it down. The plain oatmeal is just that- plain; it’s pretty hard to mess up plain oatmeal. I love how few ingredients are used to make all three flavors. Oatmeal is a classic food that really doesn’t need need a lot of added flavors or preservatives.This oatmeal is easy enough to cook in the microwave, and it turned out good for me, but I did need to cook it for slightly more than the 2 minutes suggested on the box. The instructions say you can add 2/3 cup of water or fill up the oatmeal pouch to a certain line. Well, you should definitely measure the water because depending on how tightly you are holding the pouch open or closed, there is a huge amount of variation in how much water it will take to reach the line.
  • Review 3: If you need to avoid gluten in your diet (or simply enjoy partaking of the latest health food fads), you might like this Chex Gluten-Free Oatmeal.This variety pack includes three flavors: brown sugar/cinnamon, cinnamon apple, and plain. All three taste very good.My only minor criticism involves the soupy consistency. This gluten-free oatmeal doesn’t seem as thick as regular oatmeal.I found that I had to cook the Chex oatmeal at least a minute longer than instructed to get the usual goopy consistency of oatmeal.Similarly I’m not sure if this product is quite as filling as traditional oatmeal. I was eating it as an evening snack — and one bowlful definitely left me wanting more.So I have deducted one star for the thin texture and what I am guessing will be a hefty price tag on the product.But those who insist on avoiding gluten may find these shortcomings of little consequence when the time comes to forage for porridge.PROS: Gluten-free Great taste Easy to prepareCONS: Thin texture Not as filling as regular oatmeal Expected high cost relative to traditional instant oatmeal
  • Review 4: Surprisingly delicious for instant oatmeal. When I have the time I really like slow cooked steel cut oats, but in a pinch these are a suitable substitute.
  • Review 5: Unlike most GF breakfast products, this isn’t a different texture than it’s gluten-y counterparts. It’s super delicious and tastes just like the other flavored oatmeals I’ve had before being diagnosed with Celiac Disease. I hope to see this in grocery stores VERY soon; I’d love to make it a staple in our pantry! Just like other oatmeals, you can make it in the microwave or with boiling milk or water. We used rice milk for a bowl and it was super creamy and amazing. We will definitely eat this again.
  • Review 6: My daughter thinks it tastes just like the regular quaker oatmeal. It is just as easy to make also. Now we just need from Chex to make a package with only maple flavor.
  • Review 7: Delicious gluten-free oatmeal: we tried both the regular (plain) and the flavored. (Apple Cinnamon), both were hearty, delicious, easy to make and provided a great serving size! The best part is that it’s gluten free, so that should mean it’s not cross contaminated in either the processing plant or the field (where it grows)…I’m not especially gluten allergic, only gluten sensitive and no issues with this oatmeal. It was yummy, decent calories, filling and reasonably priced, would definitely recommend!
  • Review 8: I like to eat oatmeal i the mornings. I usually buy Quaker Oats or Wegman’s house brand. I found that the cost was a lot more since I pay about $4 for a box of ten packets. This is almost double for only 6 packets.The taste was good as it compares to the two brands that I normally buy. The claim of heartier is suspect as the amount of oatmeal in the gluten free chex oatmeal is slightly more.I would only recommend this product if you are gluten intolerant otherwise purchase any other that is cheaper.
  • Review 9: Usually the label “gluten free” is a code word for “tastes like cardboard.” Fortunately, this is not the case for this delicious warm cereal.Because oatmeal is naturally gluten free, the label here is more a marketing gimmick. That being said, this is a very good instant oatmeal. It comes in 3 flavors – original, maple brown sugar, and apple cinnamon. True to its description, the oatmeal cooks very quickly and easily. You can measure in your water or milk, or Chex helpfully, but a mark that lets you use the pouch to measure the liquid.The flavors are mild, with a small amount of sugar. Enough for me, but it allows you to add more if you want.

 
続けて類似度が高い商品を検索してみると下記のような商品が出てきます。

# 類似度計算の前処理
all_topic_dists = m_lda[df_reviews['bow']]
simmat = similarities.MatrixSimilarity(all_topic_dists, num_features=300)
# 他の商品とのtopic分布の類似度を計算
sim_scores = simmat[m_lda[raw_query.bow]]
ranked_indices = sorted(enumerate(sim_scores), key=lambda item: -item[1])
# 類似度が高い商品を最大10個を出力
max_ = 10
for docidx_and_sim in ranked_indices[:max_]:
    product = df_reviews.loc[docidx_and_sim[0]]
    sim = docidx_and_sim[1]
    if product.asin != raw_query.asin:
        markdown_output(product, sim, printReview=True)
  1. Glutenfreeda Gluten Free Instant Oatmeal, Maple Raisin with Flax, 8-packet Box, 8 Pack
  2. McCANN’S Steel Cut Irish Oatmeal, 28-Ounce Tins (Pack of 4)
  3. FiberGourmet Light Spaghetti, 8-Ounce Boxes (Pack of 6)
  4. Kellogg’s Rice Krispies Gluten Free Cereal, Whole Grain Brown Rice, 12-Ounce Boxes (Pack of 4)
  5. Glutino Gluten Free Pantry Decadent Chocolate Cake Mix, 15-Ounce Boxes (Pack of 6)
  6. Quaker Oats Quick Oatmeal, 18-Ounce Packages (Pack of 6)
  7. Dreamfields Pasta Healthy Carb Living, Elbow Macaroni, 13.25-Ounce Boxes (Pack of 6)
  8. Barilla Plus Pasta, 14.5 Ounce
  9. Nature’s Path Organic Hot Cereal, Multigrain, 18 Ounce Canister (Pack of 6)

 
類似度が高い商品を見てみると下記のような特徴が見られます。

  • 同じタイプの商品 (グルテンフリーのオートミール)
  • グルテンフリーのタイプが違う商品 (チョコレッドケーキ、シリアル)
  • 健康食の商品 (オーガニックシリアル)

グルテンフリーの食品を探す顧客であれば、このような商品にもご興味を持っていただけると思います。
 

まとめ

今回のポストでは、商品の口コミデータを活用して意味的な関連性が高い商品を検索する方法を調べてみました。また、検索された類似商品は顧客が商品を購買する時の判断基準に沿ったものであることも確認できました。
一般的なコンテンツ・ベース・レコメンドシステムが販売者が登録した商品情報を元に商品間の類似度を測定・レコメンドを行うことと比べて、もっと柔軟なレコメンド結果が生成可能であると思われます。
 

Recommends

こちらもおすすめ

Special Topics

注目記事はこちら