Course: Big Data - IU S25
Author: Firas Jolha
MongoDB is a NoSQL database which stores the data in form of key-value pairs. It is an Open Source, Document Database which provides high performance and scalability along with data modelling and data management of huge sets of data in an enterprise application.
MongoDB also provides the feature of Auto-Scaling. Since, MongoDB is a cross platform database and can be installed across different platforms like Windows, Linux etc.
Here, we need to install three tools for MongoDB which are MogoDB server, MongoDB shell, MongoDB command line database tools and MongoDB compass.
MongoB server has community and enterprise editions. We are interested here in installing the community edition.
You can install MongoDB for your operating system on your local machine (without virtualization or containerization) by following the instructions in the official tutorial from here.
In Februrary 2025, the official website of MongoDB is blocked but you can use some proxy servers to access it.
You can use this option if you do not have proxy server to download the software from the official website.
You can also install the MongoDB server using the docker image hosted here. You can follow the official tutorial from here but below I will present the main steps in spinning up a MongoDB server instance using Docker.
docker pull mongodb/mongodb-community-server:latest
docker run --name mongodb_server -p 27017:27017 -d mongodb/mongodb-community-server:latest
The command above will run a MongoDB server instance (latest version) in detached mode and publish the port 27017.
Now the server is running but you are not able to run commands or execute MongoDB statements since the server does not deploy a web UI. So, you need to install the next tools. Indeed, you can access the http://localhost:27017 via http to verify that your server is working.
mongosh
The MongoDB Shell (mongosh) is not installed with MongoDB Server. You just need to download mongosh
and add it to your system PATH. You can find the official links from here. Here I will also share direct links for latest version 2.3.8 for common operating systems.
If the links below do not work, please let me know. If you have issues in downloading these files from the official website, also let me know.
If you did not find your operating system here, please go to the official website.
mongosh
is the new name given to MongoDB shell in recent versions whereas the old versions use the name mongo
.
MongoDB Shell is an open source (Apache 2.0), standalone product developed separately from the MongoDB Server.
This is another software developed to easily explore and manipulate MongoDB databases using a graphical interface. Compass provides detailed schema visualizations, real-time performance metrics, sophisticated querying abilities, and much more.
You can download it from the official website, and I share here some of the direct links to this brilliant software for the latest stable version 1.45.2.
MongoDB Compass comes with a built-in shell for running MongoDB commands but you can also use the mongosh
that you download in the previous section.
The MongoDB Database Tools are a collection of command-line utilities for working with a MongoDB deployment. These tools release independently from the MongoDB Server, so you need to install them separately. The latest version fo this package of tools is 100.11.0. You can find here the official download website but I share below the direct links.
You can find more options for operating systems in the official website.
This package will come with several tools for database management such as mongoimport
for importing files into MongoDB databases, and mongodump
for dumping the database.
You can install MongoDB on the cluster node online via yum
(see tutorial for more info) or offline via wget
and rpm
. I recommend the offline installation to avoid any issues with yum
. You need to download the packages mongodb-org, mongodb-org-mongos, mongodb-org-server, mongodb-org-shell, mongodb-org-tools for version 4.2.9. I saved the links for you and put them below. You just need to copy the following command line and paste it on your terminal of cluster node then press Enter.
wget https://repo.mongodb.org/yum/redhat/7/mongodb-org/4.2/x86_64/RPMS/mongodb-org-4.2.9-1.el7.x86_64.rpm https://repo.mongodb.org/yum/redhat/7/mongodb-org/4.2/x86_64/RPMS/mongodb-org-mongos-4.2.9-1.el7.x86_64.rpm https://repo.mongodb.org/yum/redhat/7/mongodb-org/4.2/x86_64/RPMS/mongodb-org-server-4.2.9-1.el7.x86_64.rpm https://repo.mongodb.org/yum/redhat/7/mongodb-org/4.2/x86_64/RPMS/mongodb-org-shell-4.2.9-1.el7.x86_64.rpm https://repo.mongodb.org/yum/redhat/7/mongodb-org/4.2/x86_64/RPMS/mongodb-org-tools-4.2.9-1.el7.x86_64.rpm
After downloading the packages, you need to install them via rpm
as follows:
rpm -i --force mongodb-org-4.2.9-1.el7.x86_64.rpm mongodb-org-server-4.2.9-1.el7.x86_64.rpm mongodb-org-shell-4.2.9-1.el7.x86_64.rpm mongodb-org-mongos-4.2.9-1.el7.x86_64.rpm mongodb-org-tools-4.2.9-1.el7.x86_64.rpm
If you tried installing more recent versions of MongoDB, you may get some errors. For instance, the version 4.4 has some conflicting dependencies and it is not recommended whereas the version 5.0 requires VT-x enabled CPU where the virtualization instruction should be enabled for the Docker container.
There is also another way to install MongoDB on the cluster node which is via the open source tool published in the repository as follows.
git clone https://github.com/nikunjness/mongo-ambari.git
This open source tool is quite old and will install MongoDB version 3.2 but it allows to add MongoDB as a service to the Ambari dashboard where management and configuration could be easier but it is not a big deal.
MongoDB will come with several tools for database management such mongo
or mongosh
for running MongoDB Shell, mongoimport
for importing files into MongoDB databases, and mongodump
for dumping the database.
You can work with MongoDB databases directly using the shell mongo
/mongosh
but MongoDB provides official driver libraries for multiple programming langauges including Python where performing operations on the databases will be more flexible and easier, since you are familiar with Python and can utilize other packages for data manipulation such as pandas
.
We can use the official driver/package pymongo
version via pypi as follows:
pip3 install pymongo
You can check the documentation of this tool from here.
In HDP sandbox, you can install version 3.13.0 via pip
as follows:
pip2 install pymongo
pip2
will install the latest version (3.13.0) for Python 2 whereas the latest version of the library for Python 3 is 4.11.
Note: If you added Python interpreter to Zeppelin, then you can immediately create a notebook and start working on pymongo
.
In MongoDB, databases hold collections of documents. MongoDB stores documents in collections. Collections are analogous to tables and documents to records in relational databases. If a database does not exist, MongoDB creates the database when you first store data for it. In the same way, if a collection does not exist, MongoDB creates the collection when you first store data for it.
A collection in MongoDB
Document in MongoDB is nothing but the set of key-value pairs. These documents will have dynamic schema which means that the documents in the same collection do not need to possess the same set of fields.
Since MongoDB is considered as a schema-less database, each collection can hold different type of objects. Every object in a collection is known as Document, which is represented in a JSON like (JavaScript Object Notation) structure. Data is stored and queried in BSON. BSON is a binary representation of JSON documents, though it contains more data types than JSON. MongoDB documents are composed of field-and-value pairs. The value of a field can be any of the BSON data types, including other documents, arrays, and arrays of documents.
A document in MongoDB
null
in MongoDB explicitly signals that a key exists, but its value is intentionally set to ‘nothing’ or ‘empty’. null
is used widely across MongoDB documents for various reasons, including:
For example:
db.myCollection.insert({ property: null })
The following document contains 6 fields, including _id
of type ObjectId
, name
holds an embedded document that contains the fields first
and last
, birth
and death
hold values of the Date type, contribs
holds an array of strings, and views
holds a value of the NumberLong type.
var mydoc = {
_id: ObjectId("5099803df3f4948bd2f98391"),
name: { first: "Alan", last: "Turing" },
birth: new Date('Jun 23, 1912'),
death: new Date('Jun 07, 1954'),
contribs: [ "Turing machine", "Turing test", "Turingery" ],
views : 1250000
}
Field names are strings. The field name _id
is reserved for use as a primary key; its value must be unique in the collection, is immutable, and may be of any type other than an array, it is a combination of machine identifier, timestamp and process id to keep it unique, but user can changes it to anything. Field names cannot contain the null character. MongoDB uses the dot notation to access the elements of an array and to access the fields of an embedded document.
Query filter documents are documents used specially in db.collection.find
method (equivalent to SELECT in SQL) to specify the conditions that determine which records to select for read, update, and delete operations. You can use <field>:<value>
expressions to specify the equality condition and query operator expressions.
{
<field1>: <value1>,
<field2>: { <operator>: <value> },
...
}
MongoDB uses the dot notation to access the elements of an array and to access the fields of an embedded document. For example, you access the field views
for the previous document as mydoc
.views
. To specify or access an element of an array by the zero-based index position, concatenate the array name with the dot (.) and zero-based index position, and enclose in quotes.
"<array>.<index>"
To specify the third element in the contribs
array, use the dot notation "contribs.2"
.
To specify or access a field of an embedded document with dot notation, concatenate the embedded document name with the dot (.) and the field name, and enclose in quotes:
"<embedded document>.<field>"
For example, given the following field in a document:
{
...
name: { first: "Alan", last: "Turing" },
contact: { phone: { type: "cell", number: "111-222-3333" } },
...
}
To specify the number in the phone document in the contact
field, use the dot notation "contact.phone.number"
.
The maximum BSON document size is 16 megabytes. To store documents larger than the that, MongoDB provides the GridFS API which is about dividing the large file into parts or chunks and storing each chunk as a separate document.
The fields in a BSON document are ordered. For example, {a: 1, b: 1}
is not equal to {b: 1, a: 1}
.
The sample_mflix
dataset contains data on movies and movie theaters. The dataset also contains certain metadata, including users and comments on specific movies. In this lab, we will work on sample_mflix
dataset from MongoDB website.
A fake image of Mflix brand
The dataset consists of 5 .json
files where each file will be loaded to create a collection
and each line corresponds to a document
in the collection. These files are dumped from a MongoDB database and the data is stored as Extended JSON v2.0. This format is introduced in MongoDB to preserve the datatypes and attributes of certain fields when converting from BSON to JSON. There is also Extended JSON v1.0 which is used in MongoDB versions before 4.2 and supports less datatypes than v2.0. You can read more info about MongoDB Extended JSON v2.0 from here. You can see the Extended JSON v2.0 in the files by checking the fields which start with the symbol $. For example:
{"theaterId":{"$numberInt":"1000"}}
theaterId
here is a field and has the value 1000 as an integer type.
When you use mongoimport
v4.2 for loading the data to MongoDB, the datatypes in extended JSON format will be automatically translated into corresponding types in MongoDB. If you are using older versions of mongoimport
then some datatypes will not be translated properly since the older versions support only Extended JSON v1.0.
The description of the dataset is presented in the following table.
File Name | Description |
---|---|
users.json | This collection contains information on mflix users. Each line/document contains a single user, and their name, email, and password. The fields are _id as the primary key in the collection, name as name of the user, email has the unique property, and password field which contains the user password. |
movies.json | This collection contains details on movies. Each document contains a single movie, and information such as its title, release year, and cast. |
theaters.json | This collection contains movie theater locations. Each document contains a single movie theater and its location in both string and GeoJSON forms. |
sessions.json | This collection contains metadata about users. Each document contains a user and their corresponding JSON Web Token |
comments.json | This collection contains comments associated with specific movies. Each document contains the comment text, the user who submitted it, and the movie the comment applies to. |
For more information on the dataset, you can look at here.
You can build a document store by inserting some data in it. In this lab, we will use the sample dataset from the official website. The dataset is sample_mflix
. It consists of 5 Json files.
Exercises on MongoDB
sample_mflix
. Unzip it and copy the files to the folder /sample_mflix
in the cluster node via scp
or docker cp
.mflixdb
via mongoimport
tool (Note: You do not need to create a database or a collection in advance). For example, to import comments data from comments.json
to the database, you can run the following command:Note: For running multiline command in Windows Powershell, replace the slash (\) with the backtick (`).
mongoimport --db mflixdb \
--collection comments \
--file comments.json
Note: The default porrt for the previous connection is 27017. If your server is running on a different port, you can specify the host and the port via options --host
and --port
.
Similarly, import the data from other files. Eventually, you will create 5 collections in the database which are [“comments”, “sessions”, “movies”, “theaters”, “users”].
mongoimport --db mflixdb --collection sessions --file sessions.json
mongoimport --db mflixdb --collection movies --file movies.json
mongoimport --db mflixdb --collection theaters --file theaters.json
mongoimport --db mflixdb --collection users --file users.json
Note: If the json file includes an array as the root element then you need to add the option --jsonArray
. The option --headerline
is useful for accepting the first line in the file as field list in csv
and tsv
files. You can find more options for the tool mongoimport
by running the command line mongoimport --help
.
mongo/mongosh
or from via pymongo
. To access the database mflixdb
from mongo
, you can write in mongo
as follows: > use mflixdb
switched to db mflixdb
In the mongo/mongosh
shell, db
is the variable that references the current database. The variable is automatically set to the default database test
when you access the shell or is set when you use the use <db>
to switch current database.
> db.getCollectionNames()
> db.theaters.find().limit(10)
OR
> db['theaters'].find().limit(10)
CRUD operations include create, read, update, and delete documents. All examples here are provided for monog/mongosh
shell. Adapting them to other programming language like Python is not difficult in case you decided to use PyMongo
package.
The apparent difference between mongo
shell and pymongo
, is that pymongo
uses snake_case for method names such as db.collection.insert_many()
whereasmongo
shell uses camelCase such as db.collection.insertMany()
.
If a database does not exist, MongoDB creates the database when you first store data for that database. If a database already exists with the mentioned name, then it just connects to that database.
use <database_name>
Before deleting the database, connect to the required database which is to be deleted. The default database when you access the shell is test
.
db.dropDatabase()
You can check the databases in the current DBMS instance as follows: (By default mongodb has no enabled access control, so there is no default user or password)
show databases
You can check the collections in the connected database as follows:
show collections
// db.createCollection(name, options)
db.createCollection("student", { capped : true, size : 5242880, max : 5000 } )
This will create a collection named student, with maximum size of 5 megabytes and maximum of 5000 documents.
If a collection is capped and reaches its maximum size limit, MongoDB then removes older documents from it to make space for new.
// db.collection_name.drop()
db.student.drop()
drop
method will return true is the collection is dropped successfully, else it will return false.
MongoDB provides the following methods to insert documents into a collection:
To insert a single document:
Insert a single document in users collection
To insert many documents:
db.inventory.insertMany([
{ item: "journal", qty: 25, tags: ["blank", "red"], size: { h: 14, w: 21, uom: "cm" } },
{ item: "mat", qty: 85, tags: ["gray"], size: { h: 27.9, w: 35.5, uom: "cm" } },
{ item: "mousepad", qty: 25, tags: ["gel", "blue"], size: { h: 19, w: 22.85, uom: "cm" } }
])
In MongoDB, each document stored in a collection requires a unique _id
field that acts as a primary key. If an inserted document omits the _id
field, the MongoDB driver automatically generates an ObjectId
for the _id
field.
You can query documents in MongoDB via the method db.collection.find(query, projection)
. Given the following inventory
collection:
db.inventory.insertMany([
{ item: "journal", qty: 25, size: { h: 14, w: 21, uom: "cm" }, status: "A" },
{ item: "notebook", qty: 50, size: { h: 8.5, w: 11, uom: "in" }, status: "A" },
{ item: "paper", qty: 100, size: { h: 8.5, w: 11, uom: "in" }, status: "D" },
{ item: "planner", qty: 75, size: { h: 22.85, w: 30, uom: "cm" }, status: "D" },
{ item: "postcard", qty: 45, size: { h: 10, w: 15.25, uom: "cm" }, status: "A" }
]);
Let’s run the following queries.
To select all documents in the collection, pass an empty document as the query filter parameter to the find method. The query filter parameter determines the select criteria. For example, to select all inventory data without any specific criteria:
db.inventory.find( {} )
The equivalent SQL statement is:
SELECT * FROM inventory
To specify equality conditions, use <field>:<value>
expressions in the query filter document in the following form:
{ <field1>: <value1>, ... }
For example, to select all inventory whose status is “D”:
db.inventory.find( { status : "D" } )
The equivalent SQL statement is:
SELECT * FROM inventory WHERE status = "D"
MongoDB provides several query operators to specify conditions on the retrieved data in the following form:
{ <field1>: { <operator1>: <value1> }, ... }
For example, to retrieve all documents from the inventory collection where status equals either “A” or “D”:
db.inventory.find( { status: { $in: [ "A", "D" ] } } )
Equivalently in SQL.
SELECT * FROM inventory WHERE status in ("A", "D")
A compound query can specify conditions for more than one field in the collection’s documents. Implicitly, a logical AND conjunction connects the clauses of a compound query so that the query selects the documents in the collection that match all the conditions. For example, to retrieve all documents in the inventory collection where the status equals “A” and qty is less than ($lt
) 30:
db.inventory.find( { status: "A", qty: { $lt: 30 } } )
The operation corresponds to the following SQL statement:
SELECT * FROM inventory WHERE status = "A" AND qty < 30
Using the $or
operator, you can specify a compound query that joins each clause with a logical OR conjunction so that the query selects the documents in the collection that match at least one condition. The following example retrieves all documents in the collection where the status equals “A” or qty is less than ($lt
) 30:
db.inventory.find( { $or: [ { status: "A" }, { qty: { $lt: 30 } } ] } )
The operation corresponds to the following SQL statement:
SELECT * FROM inventory WHERE status = "A" OR qty < 30
In the following example, the compound query document selects all documents in the collection where the status equals “A” and either qty is less than ($lt) 30 or item starts with the character p:
db.inventory.find( {
status: "A",
$or: [ { qty: { $lt: 30 } }, { item: /^p/ } ]
} )
The operation corresponds to the following SQL statement:
SELECT * FROM inventory WHERE status = "A" AND ( qty < 30 OR item LIKE "p%")
The db.collection.findOne()
is another method which performs a read operation to return a single document. Internally, the db.collection.findOne()
method is the db.collection.find()
method with a limit of 1
.
To specify an equality condition on a field that is an embedded/nested document, use the query filter document { <field>: <value> }
where <value>
is the document to match.
For example, the following query selects all documents where the field size equals the document { h: 14, w: 21, uom: "cm" }
:
db.inventory.find( { size: { h: 14, w: 21, uom: "cm" } } )
Equality matches on the whole embedded document require an exact match of the specified <value>
document, including the field order. For example, the following query does not match any documents in the inventory collection:
db.inventory.find( { size: { w: 21, h: 14, uom: "cm" } } )
To specify a query condition on fields in an embedded/nested document, use dot notation ("field.nestedField")
.
When querying using dot notation, the field and nested field must be inside “quotation marks”.
The following example selects all documents where the field uom nested in the size field equals "in"
:
db.inventory.find( { "size.uom": "in" } )
Or maybe you would like to search using query operators to specify conditions. For example, to retrieve the inventory items whose height is less than 15:
db.inventory.find( { "size.h": { $lt: 15 } } )
Exercises on MongoDB
// db.theaters.find(filter, projection)
db.theaters.find({})
db.theaters.find()
// wrong
db.theaters.find(
{
"location.address.state": {
$regex : "A$"
},
$expr: {
$gt: ["location.address.zipcode", 90000]
}
},
{
location : true,
theaterId : true,
_id : false
}
).limit(3)
// wrong
db.theaters.find(
{
"location.address.state": {
$regex : "A$"
},
$expr: {
$gt: ["$location.address.zipcode", 90000]
}
},
{
location : true,
theaterId : true,
_id : false
}
).limit(3)
// string comparison solution
db.theaters.find(
{
"location.address.state": {
$regex : "A$"
},
$expr: {
$gt: ["$location.address.zipcode", "90000"]
}
},
{
location : true,
theaterId : true,
_id : false
}
).limit(3)
// using operator #toInt
// This may give errors
db.theaters.find(
{
"location.address.state": {
$regex : "A$"
},
$expr: {
$gt: [{$toInt: "location.address.zipcode"}, 90000]
}
},
{
location : true,
theaterId : true,
_id : false
}
).count()
// using convert
db.theaters.find(
{
"location.address.state": {
$regex : "A$"
},
$expr: {
$gt: [
{
$convert: {
input: "$location.address.zipcode",
to: "int",
onError: {
$concat: [
"Could not convert ",
{$toString: "$location.address.zipcode" },
" to type integer."
]
},
onNull: Int32("0")
}
},
90000
]
}
},
{
location : true,
theaterId : true,
_id : false
}
).count()
db.movies.find(
{
genres: {
$in:["Drama", "Crime"]
},
num_mflix_comments: {
$gte: 100
},
countries: {
$nin: ["USA"]
},
type: {
$eq:"movie"
}
},
{countries:1}
)
db.movies.find(
{
$or: [
{
genres: {
$in: ["Drama"]
}
},
{
$and : [
{
num_mflix_comments: {
$gte: 100
}
},
{
countries:
{
{
$not:{$in: ["USA"]}
}
}
}
]
}
]
},
{_id:false, awards:false}
).limit(3)
number
. Retrieve the number of theaters which satisfy this condition.db.theaters.find(
{
"location.address.street2":{
$exists:true
},
"location.geo.coordinates.0":{
$type: "number"
}
}
).count()
Note: You can get the datatype of a specific column in MongoDB Shell by running the command typeof <object>
. For example:
typeof db.theaters.findOne().location.geo.coordinates[0]
We have 6 operators here, the most commonly used ones are $expr
, $text
, and $regex
.
$expr
allows the use of aggregation expressions within the query language. It has the following syntax:
{ $expr: { <expression> } }
The arguments can be any valid aggregation expression.
Example:
db.theaters.find(
{
"location.address.zipcode":{
$exists:true
},
"location.address.state": {
$regex : /^A/
},
"$expr": {
"$gt":
[
{
"$convert":{
input: "location.address.zipcode",
to: "decimal", onError:"", onNull:""
}
},
90000
]
}
},
{_id:false}
).count()
Note: For the operator $text
, you need to create a text index.
db.movies.createIndex(
{
title: "text",
plot: "text"
}
)
db.movies.find(
{
$text:{
$search:"\"card game\" -money -friend -criminal"
}
},
{
score:{
$meta:"textScore"
},
title:true,
plot: true
}
)
You can visit this tutorial for checking the score calculation formula.
Array operators return data based on array conditions. The $all
operator selects the documents where the value of a field is an array that contains all the specified elements.
db.movies.find(
{
genres:{
$all:['Action', 'Comedy', 'Drama']
},
countries: {
$elemMatch:{
$eq:"Canada"
}
},
$expr: {
$gt : [
{$size: "$countries"},
1
]
}
}
)
Note: I covered here some of the operators, and you can check the documentation for the full list of them.
Aggregation operations process data records and return computed results. Aggregation operations group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result.
MongoDB’s aggregation pipeline is a framework for data aggregation modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into an aggregated results. The pipeline consists of stages. Each stage transforms the documents as they pass through the pipeline. Pipeline stages do not need to produce one output document for every input document; e.g., some stages may generate new documents or filter out documents. MongoDB provides the db.collection.aggregate()
method to run the aggregation pipeline.
For example, the following query returns the total amount for each customer only for orders whose status is A.
db.orders.aggregate([
{ $match: { status: "A" } },
{ $group: { _id: "$cust_id", total: { $sum: "$amount" } } }
])
First Stage: The $match
stage filters the documents by the status field and passes to the next stage those documents that have status equal to “A”.
Second Stage: The $group
stage groups the documents by the cust_id
field to calculate the sum of the amount for each unique cust_id
.
Given the following 20 documents of zipcodes
collection:
db.zipcodes.insertMany([
{ "_id" : "01001", "city" : "AGAWAM", "loc" : [ -72.622739, 42.070206 ], "pop" : 15338, "state" : "MA" }
{ "_id" : "01011", "city" : "CHESTER", "loc" : [ -72.988761, 42.279421 ], "pop" : 1688, "state" : "MA" }
{ "_id" : "01012", "city" : "CHESTERFIELD", "loc" : [ -72.833309, 42.38167 ], "pop" : 177, "state" : "MA" }
{ "_id" : "01013", "city" : "CHICOPEE", "loc" : [ -72.607962, 42.162046 ], "pop" : 23396, "state" : "MA" }
{ "_id" : "01020", "city" : "CHICOPEE", "loc" : [ -72.576142, 42.176443 ], "pop" : 31495, "state" : "MA" }
{ "_id" : "01022", "city" : "WESTOVER AFB", "loc" : [ -72.558657, 42.196672 ], "pop" : 1764, "state" : "MA" }
{ "_id" : "01026", "city" : "CUMMINGTON", "loc" : [ -72.905767, 42.435296 ], "pop" : 1484, "state" : "MA" }
{ "_id" : "01057", "city" : "MONSON", "loc" : [ -72.31963399999999, 42.101017 ], "pop" : 8194, "state" : "MA" }
{ "_id" : "01060", "city" : "FLORENCE", "loc" : [ -72.654245, 42.324662 ], "pop" : 27939, "state" : "MA" }
{ "_id" : "01103", "city" : "SPRINGFIELD", "loc" : [ -72.588735, 42.1029 ], "pop" : 2323, "state" : "MA" }
{ "_id" : "01104", "city" : "SPRINGFIELD", "loc" : [ -72.577769, 42.128848 ], "pop" : 22115, "state" : "MA" }
{ "_id" : "01105", "city" : "SPRINGFIELD", "loc" : [ -72.578312, 42.099931 ], "pop" : 14970, "state" : "MA" }
{ "_id" : "01106", "city" : "LONGMEADOW", "loc" : [ -72.5676, 42.050658 ], "pop" : 15688, "state" : "MA" }
{ "_id" : "01107", "city" : "SPRINGFIELD", "loc" : [ -72.606544, 42.117907 ], "pop" : 12739, "state" : "MA" }
{ "_id" : "01108", "city" : "SPRINGFIELD", "loc" : [ -72.558432, 42.085314 ], "pop" : 25519, "state" : "MA" }
{ "_id" : "01109", "city" : "SPRINGFIELD", "loc" : [ -72.554349, 42.114455 ], "pop" : 32635, "state" : "MA" }
{ "_id" : "01118", "city" : "SPRINGFIELD", "loc" : [ -72.527445, 42.092937 ], "pop" : 14618, "state" : "MA" }
{ "_id" : "01119", "city" : "SPRINGFIELD", "loc" : [ -72.51211000000001, 42.12473 ], "pop" : 13040, "state" : "MA" }
{ "_id" : "01128", "city" : "SPRINGFIELD", "loc" : [ -72.48890299999999, 42.094397 ], "pop" : 3272, "state" : "MA" }
{ "_id" : "01129", "city" : "SPRINGFIELD", "loc" : [ -72.487622, 42.122263 ], "pop" : 6831, "state" : "MA" }
]);
where each document has the following structure:
{
"_id": "10280",
"city": "NEW YORK",
"state": "NY",
"pop": 5574,
"loc": [
-74.016323,
40.710537
]
}
_id
field holds the zip code as a string.city
field holds the city name. A city can have more than one zip code associated with it as different sections of the city can each have a different zip code.state
field holds the two letter state abbreviation.pop
field holds the population.loc
field holds the location as a longitude latitude pair.Let’s run the following queries.
Exercises on MonogDB
db.zipcodes.aggregate( [
{ $group: { _id: "$state", totalPop: { $sum: "$pop" } } },
{ $match: { totalPop: { $gte: 10*1000*1000 } } }
] )
In this example, the aggregation pipeline consists of the $group
stage followed by the $match
stage:
$group
stage groups the documents of the zipcode collection by the state field, calculates the totalPop field for each state, and outputs a document for each unique state. The new per-state documents have two fields: the _id
field and the totalPop
field. The _id field contains the value of the state; i.e. the group by field. The totalPop field is a calculated field that contains the total population of each state. To calculate the value, $group uses the $sum operator to add the population field (pop) for each state.After the $group stage, the documents in the pipeline resemble the following:
{
"_id" : "AK",
"totalPop" : 550043
}
$match
stage filters these grouped documents to output only those documents whose totalPop value is greater than or equal to 10 million. The $match
stage does not alter the matching documents but outputs the matching documents unmodified.The equivalent SQL for this aggregation operation is:
SELECT state, SUM(pop) AS totalPop
FROM zipcodes
GROUP BY state
HAVING totalPop >= (10*1000*1000)
Exercises on MonogDB
db.zipcodes.aggregate( [
{ $group: { _id: { state: "$state", city: "$city" }, pop: { $sum: "$pop" } } },
{ $group: { _id: "$_id.state", avgCityPop: { $avg: "$pop" } } }
] )
In this example, the aggregation pipeline consists of the $group
stage followed by another $group
stage:
After this stage in the pipeline, the documents resemble the following form:
{
"_id" : {
"state" : "CO",
"city" : "EDGEWATER"
},
"pop" : 13154
}
$group
stage groups the documents in the pipeline by the _id.state
field (i.e. the state field inside the _id document), uses the $avg expression to calculate the average city population (avgCityPop) for each state, and outputs a document for each state.The documents that result from this aggregation operation resembles the following:
{
"_id" : "MN",
"avgCityPop" : 5335
}
In the db.collection.aggregate()
method, pipeline stages appear in an array. Documents pass through the stages in sequence. It has the following schema:
db.collection.aggregate( [ { <stage> }, ... ] )
All stages except $out
and $merge
, can appear multiple times in the same pipeline.
The $group
stage separates documents into groups according to a “group key”. The output is one document for each unique group key.
{
$group:
{
_id: <expression>, // Group key
<field1>: { <accumulator1> : <expression1> },
...
}
}
The <accumulator>
operator must be one of the following accumulator operators:
<expressions>
can include field paths, literals, system variables, expression objects, and expression operators. Expressions can be nested. Field path is the path to a field in the document. To specify a field path, use a string that prefixes the field name with a dollar sign ($). For example, "$user"
to specify the field path for the user field or "$user.name"
to specify the field path to "user.name"
field. "$<field>"
is equivalent to "$$CURRENT.<field>"
where the CURRENT
is a system variable. Check here for a list of system variables.
Check the documentation for other stages and pipeline operators.
Exercises on MonogDB
db.movies.aggregate(
[
{
$group:{
_id: "$type",
average: {$avg: "$num_mflix_comments"
}
}
}
]
)
db.movies.aggregate(
[
{
$match:{
countries:{
$in:['China']
}
}
},
{
$group:{
_id: "$type",
average: {
$avg: "$num_mflix_comments"
}
}
},
{
$sort:{
average: 1,
}
}
]
)
db.movies.aggregate(
[
{
$match:
{
cast:
{
$ne:null
}
}
},
{
$addFields:
{
cast_size:
{
$size : "$cast"
}
}
},
{
$group:
{
_id: "$type",
cast_average:
{
$avg: "$cast_size"
}
}
},
{
$addFields:
{
cast_average_int:
{
$toInt:"$cast_average"
}
}
}
]
)
country_movie_count
.db.movies.aggregate(
[
{
$match:
{
type:
{
$ne:null
},
countries:
{
$type:"array"
}
}
},
{
$unwind: "$countries"
},
{
$group:
{
_id: [
"$countries",
"$type"
],
count:
{
$count: {}
}
}
},
{
$sort:
{
count:-1
}
},
{
$limit:5
},
{
$project:
{
_id:false,
'Country':
{
$arrayElemAt: ["$_id", 0]
},
'Movie':
{
$arrayElemAt:["$_id", 1]
},
count:true
}
} ,
{
$out: "country_movie_count"
}
]
)
We can join documents on collections in MongoDB by using the $lookup
(Aggregation) function. $lookup
(Aggregation) creates an outer left join with another collection and helps to filter data from merged data. If documents are part of a “joined” collection, the $lookup
(Aggregation) function will return documents in the form of a subarray of the original collection.
In MongoDB’s JOIN operation, the goal is to connect one set of data to another set of data. To join two collections, we use the $lookup
operator, whose syntax is defined below:
{
$lookup:
{
from: <collection to join>,
localField: <field from the input documents>,
foreignField: <field from the documents of the "from" collection>,
as: <output array field>
}
}
The $lookup
function accepts a document containing these fields:
Example:
db.address.insertMany(
[
{
"name": "Bob",
"blk_no": 22,
"street" : "dewey street",
"city" : "United States of America"
},
{
"name": "Jack",
"blk_no": 25,
"street" : "gordon street",
"city" : "New Zealand"
}
]
);
{
"acknowledged" : true,
"insertedIds" : [
ObjectId("613594cdb59313217373673c"),
ObjectId("613594cdb59313217373673d")
]
}
db.userInfo.insertMany(
[
{
"contact_name": "Bob",
"age": 27,
"sex" : "male",
"citizenship" : "Filipino"
},
{
"contact_name": "Jack",
"age": 22,
"sex" : "male",
"citizenship" : "Filipino"
}
]
);
{
"acknowledged" : true,
"insertedIds" : [
ObjectId("613594dbb59313217373673e"),
ObjectId("613594dbb59313217373673f")
]
}
The join operation is:
db.userInfo.aggregate([
{ $lookup:
{
from: "address",
localField: "contact_name",
foreignField: "name",
as: "address"
}
}
]).pretty();
The output is:
Output:
{
"_id" : ObjectId("613594dbb59313217373673e"),
"contact_name" : "Bob",
"age" : 27,
"sex" : "male",
"citizenship" : "Filipino",
"address" : [
{
"_id" : ObjectId("613594cdb59313217373673c"),
"name" : "Bob",
"blk_no" : 22,
"street" : "dewey street",
"city" : "United States of America"
}
]
}
{
"_id" : ObjectId("613594dbb59313217373673f"),
"contact_name" : "Jack",
"age" : 22,
"sex" : "male",
"citizenship" : "Filipino",
"address" : [
{
"_id" : ObjectId("613594cdb59313217373673d"),
"name" : "Jack",
"blk_no" : 25,
"street" : "gordon street",
"city" : "New Zealand"
}
]
}
To learn more about aggregation framework. I recommend the online book https://www.practical-mongodb-aggregations.com
.
db.movies.aggregate(
[
{
$lookup:
{
from: "comments",
foreignField: "movie_id",
localField: "_id",
as : "movie_comments"
}
},
{
$out: "movie_comments"
}
]
)
db.movie_comments.aggregate(
[
{
$group:
{
_id:"$movie_comments.email",
movie_count:
{
$count:{}
}
}
},
{
$match:
{
_id:
{
$not:{$size: 0}
}
}
},
{
$sort:{
movie_count:-1
}
}
]
)
If your documents include text fields then you can speed up text search via using $text
operator but you need to create an index for that field. $text
performs a text search on the content of the fields indexed with a text index. A $text
expression has the following syntax:
{
$text:
{
$search: <string>,
$language: <string>,
$caseSensitive: <boolean>,
$diacriticSensitive: <boolean>
}
}
Example:
The following examples assume a collection articles that has a version 3 text
index on the field subject:
db.articles.createIndex( { subject: "text" } )
Populate the collection with the following documents:
db.articles.insert(
[
{ _id: 1, subject: "coffee", author: "xyz", views: 50 },
{ _id: 2, subject: "Coffee Shopping", author: "efg", views: 5 },
{ _id: 3, subject: "Baking a cake", author: "abc", views: 90 },
{ _id: 4, subject: "baking", author: "xyz", views: 100 },
{ _id: 5, subject: "Café Con Leche", author: "abc", views: 200 },
{ _id: 6, subject: "Сырники", author: "jkl", views: 80 },
{ _id: 7, subject: "coffee and cream", author: "efg", views: 10 },
{ _id: 8, subject: "Cafe con Leche", author: "xyz", views: 10 }
]
)
The following query specifies a $search
string of “coffee”:
db.articles.find( { $text: { $search: "coffee" } } )
This query returns the documents that contain the term coffee in the indexed subject field, or more precisely, the stemmed version of the word:
{ "_id" : 2, "subject" : "Coffee Shopping", "author" : "efg", "views" : 5 }
{ "_id" : 7, "subject" : "coffee and cream", "author" : "efg", "views" : 10 }
{ "_id" : 1, "subject" : "coffee", "author" : "xyz", "views" : 50 }
The following example searches for documents that contain the words coffee but do not contain the term shop, or more precisely the stemmed version of the words:
db.articles.find( { $text: { $search: "coffee -shop" } } )
The query returns the following documents:
{ "_id" : 7, "subject" : "coffee and cream", "author" : "efg", "views" : 10 }
{ "_id" : 1, "subject" : "coffee", "author" : "xyz", "views" : 50 }
The following query searches for the term cake and returns the score assigned to each matching document:
db.articles.find(
{ $text: { $search: "cake" } },
{ score: { $meta: "textScore" } }
)
The returned document includes an additional field score that contains the document’s score associated with the text search. $meta
operator is an aggregation operator.
The following query searches for documents where the author equals “xyz” and the indexed field subject contains the terms coffee or bake. The operation also specifies a sort order of ascending date, then descending text search score:
db.articles.find(
{
author: "xyz",
$text: { $search: "coffee bake" }
},
{
score: { $meta: "textScore" } }
).sort(
{
date: 1, score: { $meta: "textScore"}
}
)