Pizza Connections

IT Technology Blog

Menu Close

MovieLens recommendation engine with OrientDB

It's time to share...Share on FacebookTweet about this on TwitterShare on LinkedInShare on Google+Pin on Pinterest

The aim of this post is to create a movie recommendation engine through OrientDB using the same dataset used by Marko Rodriguez ( http://markorodriguez.com/2011/09/22/a-graph-based-movie-recommender-engine/ ).

First, let’s download the dataset from the grouplens website. For this specific task, i’ve downloaded the version with a million objects.

The dataset contains data about users and users occupation, movies and the ratings. With this dataset we builds a graph schema in OrientDB to create a recommendation engine.

 

Database creation

Run our OrientDb database:

Then create our MovieRatings database:

 

Now, let’s create and populate the Occupation class that stores data about users occupation:

This is the insert script:

insert into Occupation (id,description)
values
(0,'other'),
(1,'academic/educator'),
(2,'artist'),
(3,'clerical/admin'),
(4,'college/grad student'),
(5,'customer service'),
(6,'doctor/health care'),
(7,'executive/managerial'),
(8,'farmer'),
(9,'homemaker'),
(10,'K-12 student'),
(11,'lawyer'),
(12,'programmer'),
(13,'retired'),
(14,'sales/marketing'),
(15,'scientist'),
(16,'self-employed'),
(17,'technician/engineer'),
(18,'tradesman/craftsman'),
(19,'unemployed'),
(20,'writer')

Importing data with OrientDB ETL

We need to create this function, from Studio or launching this script into OrientDB console:

Create function split ["myString","separator"] "return myString.split(separator)"

Movies

The following Movies.json file for OrientDB ETL allows the creation of both Movies and Genres nodes together with the edges between them.

{
"config": {
"log": "info",
parallel: false
},
"source": { "file": { "path": "<your-path>movies.dat"} },
"extractor": { "row": {} },
"transformers": [
{ "csv": {"separator": "^",
"columnsOnFirstLine":false,
"columns":["id","title:string","genres"]}},
{ "field": { "fieldName": "genresArray", "expression": "split(genres,'|')"} },
{ "field": { "fieldName": "genresArray_0", "expression": "genresArray[0]"} },
{ "field": { "fieldName": "genresArray_1", "expression": "genresArray[1]"} },
{ "field": { "fieldName": "genresArray_2", "expression": "genresArray[2]"} },
{ "field": { "fieldName": "genresArray_3", "expression": "genresArray[3]"} },
{ "field": { "fieldName": "genresArray_4", "expression": "genresArray[4]"} },
{ "field": { "fieldName": "genresArray_5", "expression": "genresArray[5]"} },
{ "vertex": { "class": "Movies" } },
{ "edge": {
"class": "HasGenera",
"joinFieldName": "genresArray_0",
"lookup": "Genres.description",
"unresolvedLinkAction": "CREATE"
} },
{ "edge": {
"class": "HasGenera",
"joinFieldName": "genresArray_1",
"lookup": "Genres.description",
"unresolvedLinkAction": "CREATE"
} },
{ "edge": {
"class": "HasGenera",
"joinFieldName": "genresArray_2",
"lookup": "Genres.description",
"unresolvedLinkAction": "CREATE"
} },
{ "edge": {
"class": "HasGenera",
"joinFieldName": "genresArray_3",
"lookup": "Genres.description",
"unresolvedLinkAction": "CREATE"
} },
{ "edge": {
"class": "HasGenera",
"joinFieldName": "genresArray_4",
"lookup": "Genres.description",
"unresolvedLinkAction": "CREATE"
} },
{ "edge": {
"class": "HasGenera",
"joinFieldName": "genresArray_5",
"lookup": "Genres.description",
"unresolvedLinkAction": "CREATE"
} }
],
"loader": {
"orientdb": {
"dbURL": "remote:localhost/MovieRatings",
"dbType": "graph",
"standardElementConstraints": false,
"classes": [
{"name": "Movies", "extends": "V"},
{"name": "Genres", "extends": "V"},
{"name": "HasGenera", "extends": "E"}
], "indexes": [
{"class":"Movies", "fields":["id:integer"], "type":"UNIQUE" },
{"class":"Genres", "fields":["description:string"], "type":"UNIQUE_HASH_INDEX" }
]
}
}
}

 

Execute OrientDB ETL:

After that, launch this comand from Studio:

delete vertex Genres where description=false

At this point you should have the movie structure completed

 

Users

You can import the users.dat into Users class, executing this json in OrientDB ETL

{
    "config": {
        "log": "info",
        "parallel": false
    },
    "source": {
        "file": {
            "path": "<your-path>users.dat"
        }
    },
    "extractor": {
        "row": {

        }
    },
    "transformers": [{
        "csv": {
            "separator": "^",
            "columnsOnFirstLine": false,
            "columns": ["id",
            "gender:string",
            "age:integer",

            "occupationId:integer",

            "zipCode:string"]

        }

    },

    {

        "vertex": {

            "class": "Users"

        }

    },

    {

        "edge": {

            "class": "hasOccupation",

            "joinFieldName": "occupationId",

            "lookup": "Occupation.id",

            "unresolvedLinkAction": "CREATE"

        }

    }

    ],

    "loader": {

        "orientdb": {

            "dbURL": "remote:localhost/MovieRatings",

            "dbType": "graph",

            "standardElementConstraints": false,

            "classes": [{

                "name": "Users",

                "extends": "V"

            },

            {

                "name": "hasOccupation",

                "extends": "E"

            }],

            "indexes": [{

                "class": "Users",

                "fields": ["id:integer"],

                "type": "UNIQUE"

            }]

        }

    }

}

 

Ratings

 

At last, let’s execute the script to load the ratings edges:

{
    "config": {
        "log": "info",
#        "parallel": false
    },
    "source": {
        "file": {
            "path": "D:\\ETL\\ratings.dat"
        }
    },
    "extractor": {
        "row": {

        }
    },
    "transformers": [{
        "csv": {
            "separator": "^",
            "columnsOnFirstLine": false,
            "columns": ["userId:integer",
            "movieId:integer",
            "rating:integer",

            "ratingDate"]

        }

    },

    {

        "command": {

            "command": "create edge rated from (select from Users where id = ${input.userId}) to (select from Movies where id = ${input.movieId}) set rating = ${input.rating} , ratingDate = ${input.ratingDate}",

            "output": "edge"

        }

    }],

    "loader": {

        "orientdb": {

            "dbURL": "remote:localhost/MovieRatings",

            "dbType": "graph",

            "standardElementConstraints": false,

            "classes": [{

                "name": "rated",

                "extends": "E"

            }]

        }

    }

}

 

My database stores Users class into the cluster 16 and Movies into the cluster 13. Then By launching the query “SELECT FROM #16:0”, where #16:0 is Toy Story movie,  we can play with the graph.

Recommendation engine query

Let’s start with a simple query to obtain the ratings of the user #16.0:

select title, inE('rated')[out = #16:0].rating from (select expand(out('rated')) from #16:0)

Then take the 5 star ratings of the user #16:0

select expand(outE('rated')[rating = 5].in.title) from #16:0

Among the results, we want to take the one that has recieved more five stars between all the users:

select @rid, title, inE('rated')[rating=5].size() as 5stars
  from (select expand(outE('rated')[rating = 5].in) from #16:0)
 order by 5stars desc

 

Prepare for the recommendation by searching the users that gave the most 5 stars to the films to which the users #16:0 has given 5 stars:

select out.@rid as rid, out.id as id, count(*) as conto
  from (select expand(inE('rated')[rating=5])
          from (select expand(outE('rated')[rating = 5].in) from #16:0))
 where out <> #16:0
 group by out.@rid, out.id
 order by conto desc

Finally, let’s see see among the first 10 users similar to the user 16:0, which film has received more 5 stars and is still not present in the films rated by 16:0.

select title, count(*) as conto
  from (select expand(rid.outE('rated')[rating = 5].in)
          from (select @rid as rid, id as id, count(*) as conto
                  from (select expand(outE('rated')[rating=5].in.inE('rated'[rating=5].out) from #16:0)
                         where @rid <> #16:0 group by rid, id order by conto desc limit 10))
 where title not in (select out('rated').title from #16:0)
 group by title
 order by conto desc

The following query allows a recommendation by genre

select title, count(*) 
  from (select expand(rid.in().inE('rated')[rating = 5].in)
          from (select @rid, description, count(*)
                  from (select expand(in.out('hasGenera'))
                          from (select expand(outE()) from #16:481)
                         where rating > 3)
                 group by @rid, description order by count desc limit 1))
  where title not in (select out('rated').title from #16:481)
  group by title order by count desc

It unravels the genre of interest for the user #16:0, later on it recovers the films of that genre that scored more than 5 stars among all the users

This last query start from the movie Toy Story (# 13: 0)  and let the recommended movie by those who love Toy Story. That is the film that has collected more than five stars among those who gave 5 stars to Toy Story.

select title, count(*)
  from (select expand(inE('rated')[rating = 5].out.outE('rated')[rating = 5].in) from #13:0)
 where @rid not in [#13:0]
 group by title
 order by count desc
 limit 10

 

Conclusions

In this post we saw how to easily import from csv in OrientDB through OrientDB ETL. This tool allows the creation of our graph using simple json configuration files without a programming language.Therefore the recommendation engine is built with the sql API of OrientDB.Although on one hand the SQL language brings to the use of a lot of nested queries that might result unreadable ( I’m reserving to review all the queries to improve the readability and prestations anyawy) on the other extended SQL allows an approach more immediate to the database. In most cases, in fact, you can write easily queries after having apprehended the basilar functions to move on the database vertexes. Response times of queries are excellent considering the high number of relations that are necessary to get through. This would have put to test any other RDBMS.

It's time to share...Share on FacebookTweet about this on TwitterShare on LinkedInShare on Google+Pin on Pinterest

© 2017 Pizza Connections. All rights reserved.

Theme by Anders Norén.