Skip to main content
Version: 1.4.0

Import data

Memgraph comes with tools for importing data into the database. Data can be imported programatically using the LOAD CSV Cypher clause, using the CSV import tool, or using saved cypher queries with mg_client running in non-interactive mode.

info

NOTE: mg_client is a deprecated tool still coming within the Memgraph package. mgconsole will replace mg_client in the future. If possible, please use mgconsole instead.

CSV Import Tool​

CSV is a universal and very versatile data format used to store large quantities of data. Each Memgraph database instance has a CSV import tool installed called mg_import_csv. The CSV import tool should be used for initial bulk ingestion of data into the database. Upon ingestion, the CSV importer creates a snapshot that will be used by the database to recover its state on its next startup.

If you are already familiar with the Neo4j bulk import tool, then using the mg_import_csv tool should be easy. The CSV import tool is fully compatible with the Neo4j CSV format. If you already have a pipeline set-up for Neo4j, you should only replace neo4j-admin import with mg_import_csv.

CSV File Format​

Each row of a CSV file represents a single entry that should be imported into the database. Both nodes and relationships can be imported into the database using CSV files.

Each set of CSV files must have a header that describes the data that is stored in the CSV files. Each field in the CSV header is in the format <name>[:<type>] which identifies the name that should be used for that column and the type that should be used for that column. The type is optional and defaults to string (see the following chapter).

Each CSV field must be divided using the delimiter (--delimiter flag) and each CSV field can either be quoted or unquoted. When the field is quoted, the first and last character in the field must be the quote character (--quote flag). If the field isn't quoted, and a quote character appears in it, it is treated as a regular character. If a quote character appears inside a quoted string then the quote character must be doubled in order to escape it. Line feeds and carriage returns are ignored in the CSV file, also, the file can't contain a NULL character.

The CSV parser uses the same logic as the standard Python CSV parser. The data is parsed in the same way as the following snippet:

import csv
for row in csv.reader(stream, strict=True):
# process 'row'

Python uses 'excel' as the default dialect when parsing CSV files and the default settings for the CSV parser are:

  • delimiter: ','
  • doublequote: True
  • escapechar: None
  • lineterminator: '\r\n'
  • quotechar: '"'
  • skipinitialspace: False

The above snippet can be expanded to:

import csv
for row in csv.reader(stream, delimiter=',', doublequote=True,
escapechar=None, lineterminator='\r\n',
quotechar='"', skipinitialspace=False,
strict=True):
# process 'row'

For more information about the meaning of the above values, see: https://docs.python.org/3/library/csv.html#csv.Dialect

Properties​

Both nodes and relationships can have properties added to them. When importing properties, the CSV importer uses the name specified in the header of the corresponding CSV column for the name of the property. A property is designated by specifying one of the following types in the header:

  • integer, int, long, byte, short: creates an integer property
  • float, double: creates a float property
  • boolean, bool: creates a boolean property
  • string, char: creates a string property

When importing a boolean value, the CSV field should contain exactly the text true to import a True boolean value. All other text values are treated as a boolean value False.

If you want to import an array of values, you can do so by appending [] to any of the above types. The values of the array are then determined by splitting the raw CSV value using the array delimiter (--array-delimiter flag) character.

Assuming that the array delimiter is ;, the following example:

first_name,last_name:string,number:integer,aliases:string[]
John,Doe,1,Johnny;Jo;J-man
Melissa,Doe,2,Mel

Will yield these results:

CREATE ({first_name: "John", last_name: "Doe", number: 1, aliases: ["Johnny", "Jo", "J-man"]});
CREATE ({first_name: "Melissa", last_name: "Doe", number: 2, aliases: ["Mel"]});

Nodes​

When importing nodes, several more types can be specified in the header of the CSV file (along with all property types):

  • ID: id of the node that should be used as the node ID when importing relationships
  • LABEL: designates that the field contains additional labels for the node
  • IGNORE: designates that the field should be ignored

The ID field type sets the internal ID that will be used for the node when creating relationships. It is optional and nodes that don't have an ID value specified will be imported, but can't be connected to any relationships. If you want to save the ID value as a property in the database, just specify a name for the ID (user_id:ID). If you just want to use the ID during the import, leave out the name of the field (:ID). The ID field also supports creating separate ID spaces. The ID space is specified with the ID space name appended to the ID type in parentheses (ID(user)). That allows you to have the same IDs (by value) for multiple different node files (for example, numbers from 1 to N). The IDs in each ID space will be treated as an independent set of IDs that don't interfere with IDs in another ID space.

The LABEL field type adds additional labels to the node. The value is treated as an array type so that multiple additional labels can be specified for each node. The value is split using the array delimiter (--array-delimiter flag).

Relationships​

In order to be able to import relationships, you must import the nodes in the same invocation of mg_import_csv that is used to import the relationships.

When importing relationships, several more types can be specified in the header of the CSV file (along with all property types):

  • START_ID: id of the start node that should be connected with the relationship
  • END_ID: id of the end node that should be connected with the relationship
  • TYPE: designates the type of the relationship
  • IGNORE: designates that the field should be ignored

The START_ID field type sets the start node that should be connected with the relationship to the end node. The field must be specified and the node ID must be one of the node IDs that were specified in the node CSV files. The name of this field is ignored. If the node ID is in an ID space, you can specify the ID space for the in the same way as for the node ID (START_ID(user)).

The END_ID field type sets the end node that should be connected with the relationship to the start node. The field must be specified and the node ID must be one of the node IDs that were specified in the node CSV files. The name of this field is ignored. If the node ID is in an ID space, you can specify the ID space for the in the same way as for the node ID (END_ID(user)).

The TYPE field type sets the type of the relationship. Each relationship must have a relationship type, but it doesn't necessarily need to be specified in the CSV file, it can also be set externally for the whole CSV file. The name of this field is ignored.

CSV Importer Flags​

The importer has many command line options that allow you to customize the way the importer loads your data.

The two main flags that are used to specify the input CSV files are --nodes and --relationships.

The --nodes flag is used to specify CSV files that contain the nodes to the importer. Multiple files can be specified in each supplied --nodes flag. Files that are supplied in one --nodes flag are treated by the CSV parser as one big CSV file. Only the first line of the first file is parsed for the CSV header, all other files (and rows) are treated as data. This is useful when you have a very large CSV file and don't want to edit its first line just to add a CSV header. Instead, you can specify the header in a separate file (e.g. users_header.csv) and have the data intact in the large file (e.g. users.csv). Also, you can supply additional labels for each set of node files. The format of this flag is: [<label>[:<label>]...=]<file>[,<file>][,<file>].... Take note that only the <file> part is mandatory, all other parts of the flag value are optional. Multiple --nodes flags can be supplied to describe multiple sets of different node files. For the importer to work, at least one --nodes flag must be supplied.

The --relationships flag is used to specify CSV files that contain the relationships to the importer. Multiple files can be specified in each supplied --relationships flag. Files that are supplied in one --relationships flag are treated by the CSV parser as one big CSV file. Only the first line of the first file is parsed for the CSV header, all other files (and rows) are treated as data. This is useful when you have a very large CSV file and don't want to edit its first line just to add a CSV header. Instead, you can specify the header in a separate file (e.g. friendships_header.csv) and have the data intact in the large file (e.g. friendships.csv). Also, you can set the type of all relationships in the files for each set of relationships files. The format of this flag is: [<type>=]<file>[,<file>][,<file>].... Take note that only the <file> part is mandatory, all other parts of the flag value are optional. Multiple --relationships flags can be supplied to describe multiple sets of different relationship files. The --relationships flag isn't mandatory.

The --delimiter flag (default ,) sets the delimiter that should be used when splitting the CSV fields.

The --quote flag (default ") sets the quote character that should be used to quote a CSV field.

The --array-delimiter flag (default ;) sets the delimiter that should be used when splitting array values.

The --id-type flag (default STRING) specifies which data type should be used to store the supplied node IDs when storing them as properties (if the field name is supplied). The supported values are either STRING or INTEGER.

The --ignore-empty-strings flag (default false) tells the importer to treat all empty strings as Null values instead of an empty string value.

The --ignore-extra-columns flag (default false) tells the importer to ignore all columns (instead of raising an error) that aren't specified after the last specified column in the CSV header.

The --skip-bad-relationships flag (default false) tells the importer to ignore all relationships (instead of raising an error) that refer to nodes that don't exist in the node files.

The --skip-duplicate-nodes flag (default false) tells the importer to ignore all duplicate nodes (instead of raising an error). Duplicate nodes are nodes that have an ID that is the same as another node that was already imported.

The --trim-strings flag (default false) tells the importer to trim all of the loaded CSV field values before processing them further. Trimming the fields removes all leading and trailing whitespace from them.

How to Use the CSV Import Tool?​

The import tool is run from the console, using the mg_import_csv command. The tool should be run as user memgraph, using the following command:

sudo -u memgraph mg_import_csv

If you installed Memgraph using Docker, you will need to run the importer using the following command:

docker run -v mg_lib:/var/lib/memgraph -v mg_etc:/etc/memgraph -v mg_import:/import-data \
--entrypoint=mg_import_csv memgraph

You can pass CSV files containing node data using the --nodes option. Multiple files can be specified by repeating the --nodes option. At least one node file should be specified. Similarly, graph edges (also known as relationships) are passed via the --relationships option. Multiple relationship files are imported by repeating the option. Unlike nodes, relationships are not required.

Internally, the CSV import tool creates a database snapshot from the CSV files. By default, the tool will search for the installed Memgraph configuration and will store the snapshot inside its configured data directory using the configured properties on edges setting. If the configuration isn't found, you will need to use the --data-directory option to specify the data directory and --storage-properties-on-edges to specify whether properties on edges are enabled. Naturally, you can use the same options to override the default behavior. Memgraph will recover the imported data on the next startup by looking in the data directory. For more details on Memgraph's durability and data recovery features, please check out the appropriate article.

It is also important to note that importing CSV data using the mg_import_csv command should be a one-time operation before running Memgraph. In other words, this tool should not be used to import data into an already running Memgraph instance.

For information on other options, run:

sudo -u memgraph mg_import_csv --help

When using Docker, this translates to:

docker run --entrypoint=mg_import_csv memgraph --help

Example​

Let's import a simple dataset.

Store the following in comment_nodes.csv.

id:ID(COMMENT_ID),country:string,browser:string,content:string,:LABEL
0,Croatia,Chrome,yes,Message;Comment
1,United Kingdom,Chrome,thanks,Message;Comment
2,Germany,,LOL,Message;Comment
3,France,Firefox,I see,Message;Comment
4,Italy,Internet Explorer,fine,Message;Comment

Now, let's add forum_nodes.csv.

id:ID(FORUM_ID),title:string,:LABEL
0,General,Forum
1,Support,Forum
2,Music,Forum
3,Film,Forum
4,Programming,Forum

And finally, set relationships between comments and forums in relationships.csv.

:START_ID(COMMENT_ID),:END_ID(FORUM_ID),:TYPE
0,0,POSTED_ON
1,1,POSTED_ON
2,2,POSTED_ON
3,3,POSTED_ON
4,4,POSTED_ON

Now, you can import the dataset using the CSV importer tool.

WARNING: Your existing snapshot and WAL data will be considered obsolete, and Memgraph will load the new dataset.

Use the following command:

sudo -u memgraph mg_import_csv --nodes comment_nodes.csv --nodes forum_nodes.csv --relationships relationships.csv

If using Docker, things are a bit more complicated. First you need to copy the CSV files where the Docker image can see them:

docker container create --name mg_import_helper -v mg_import:/import-data busybox
docker cp comment_nodes.csv mg_import_helper:/import-data
docker cp forum_nodes.csv mg_import_helper:/import-data
docker cp relationships.csv mg_import_helper:/import-data
docker rm mg_import_helper

Then, run the importer with the following:

docker run -v mg_lib:/var/lib/memgraph -v mg_etc:/etc/memgraph -v mg_import:/import-data \
--entrypoint=mg_import_csv memgraph \
--nodes /import-data/comment_nodes.csv --nodes /import-data/forum_nodes.csv \
--relationships /import-data/relationships.csv

Next time you run Memgraph, the dataset will be loaded.

Importing Cypher queries​

When Memgraph is running, cypher queries are imported by running mg_client in non-interactive mode. The user can import queries saved in e.g. queries.txt by issuing the following shell command:

mg_client --use-ssl=False < queries.txt

If you installed Memgraph using Docker, you will need to run the client using the following command:

docker run -i --entrypoint=mg_client memgraph --host HOST --use-ssl=False < queries.txt

Remember to replace HOST with valid IP of the container (see Note for Docker users).

For more information about mg_client options run:

mg_client --help

Using the LOAD CSV Cypher clause​

The LOAD CSV clause enables you to load and use data from a CSV file of your choosing in a row-based manner, within a query. We support the Excel CSV dialect, as it's the most commonly used one.

The syntax of the clause is:

LOAD CSV FROM <csv-file-path> ( WITH | NO ) HEADER [IGNORE BAD] [DELIMITER <delimiter-string>] [QUOTE <quote-string>] AS <variable-name>
  • <csv-file-path is a string holding the path to the CSV file. There are no restrictions on where in your filesystem the file can be located, as long as the path is valid (i.e. the file exists).

  • ( WITH | NO ) HEADER flag specifies whether the CSV file is to be parsed as though it has or hasn't got a header.

  • IGNORE BAD flag specifies whether rows containing errors should be ignored or not. If it's set, the parser attempts to return the first valid row from the CSV file. If it isn't set, an exception will be thrown on first invalid row encountered.

  • DELIMITER <delimiter-string> option enables you to specify the CSV delimiter character. If it isn't set, the default delimiter character , is assumed.

  • QUOTE <quote-string> option enables you to specify the CSV quote character. If it isn't set, the default quote character " is assumed.

  • <variable-name> is a symbolic name representing the variable to which the contents of the parsed row will be bound to, enabling access to the row contents later in the query.

The clause reads row by row from a CSV file, and binds the contents of the parsed row to the variable you specified.

It's important to note that the parser parses the values as strings. It's up to the user to convert the parsed row values to the appropriate type. This can be done using the built-in conversion functions such as ToInteger, ToFloat, ToBoolean etc. Consult the documentation on the available conversion functions.

Depending on how you set the HEADER option (WITH or NO), a row will be parsed as either a map or a list.

If the WITH HEADER option is set, the very first line in the file will be parsed as the header, and any remaining rows will be parsed as regular rows. The value bound to the row variable will be a map of the form:

{ ( "header_field" : "row_value" )? ( , "header_field" : "row_value" )* }

To access a given field, you can use the property lookup syntax. Let's assume that the CSV file contents are as follows:

x|y|z
1|2|3
4|5|6

The following query will load row by row from the file, and create a new node for each row with properties based on the parsed row values:

LOAD CSV FROM "xyz.csv" WITH HEADER DELIMITER "|" AS row
CREATE (n:A {x: ToInteger(row.x), y: ToInteger(row.y), z: ToInteger(row.z)}) ;

If the NO HEADER option is set, then each row is parsed as a list of values. The contents of the row can be accessed using the list index syntax. Note that in this mode, there are no restrictions on the number of values a row contains. This isn't recommended, as the user must manually handle the varying number of values in a row.

Let's assume that the CSV file contents are as follows:

1|2|3
4|5|6

The following query will load row by row from the file, and create a new node for each row with properties based on the parsed row values:

LOAD CSV FROM "xyz.csv" NO HEADER DELIMITER "|" AS row
CREATE (n:A {x: ToInteger(row[0]), y: ToInteger(row[1]), z: ToInteger(row[2])}) ;

The clause can't stand on its own, meaning there has to be at least one more clause in the query, in addition to it. In other words, the following query will throw an exception:

LOAD CSV FROM "file.csv" WITH HEADER AS row ;

On the other hand, the following query is valid:

MERGE (n:A) LOAD CSV FROM "file.csv" WITH HEADER AS row ;

Note that the row variable doesn't have to be used in any subsequent clause.

Also, it's important to note how the LOAD CSV result exhaustion works. Let's assume the following query:

MATCH (n)
LOAD CSV FROM "file.csv" WITH HEADER as row
SET n.p = row ;

Let's say the MATCH (n) clause finds five nodes, whereas the "file.csv" has only 2 rows. Then only the first two nodes returned by the MATCH (n) will have their properties set, using the two rows from the file. Similarly, if the MATCH (n) clause finds two nodes, whereas the "file.csv" has five rows, the two nodes returned by MATCH (n) will have their properties set with values from the first two rows of the file. In general, the clause that exhausts its results sooner will dictate how many times the "loop" will be executed.

Finally, note that the LOAD CSV clause can be used at most once per query. Trying to use multiple LOAD CSV clauses within a single query will throw an exception. In other words, queries such as

LOAD CSV FROM "x.csv" WITH HEADER as x
LOAD CSV FROM "y.csv" WITH HEADER as y
CREATE (n:A { p1 : x, p2 : y });

are invalid.