Image showing Nodz working with Drupal's Feeds module.

Overfeeding: Taking Drupal's Feeds module to the edge - i18n import

18.03.14

Getting data into Drupal: How Drupal’s Feeds module can be used to import multilingual data.

Importing data with Drupal’s Feeds module

When working on the new website for Kobler & Partner, we decided to switch from Contao to Drupal, which meant that we not only used Drupal to build the site, but that we had to import existing data into it as well. Some of this data was stored in another CMS (Contao), some of it was stored in a proprietary database (FileMaker) with only CSV exports available. The architecture was both multilingual and highly relational, with many entity references between different types of nodes. Implementing it was one thing, but testing or styling it on the frontend was difficult without actual data, which is why importing data was a critical step. The most difficult parts of this were of course that a) the data was relational and b) the data was multilingual – in fact, the data was actually both at each level.

I am going to explain how I got the import working, taking it one step at a time. In this first part, we’ll look at importing multilingual data. If you already know Drupal’s Feeds module and just want to read about the i18n import trick or if you simply don’t like words because you’re a developer, feel free to skip directly to “Importing multilingual data” and read code!

Drupal import options

Before jumping into the heart of it, let’s talk Drupal modules. The two which I considered using for the import are Feeds and Migrate. Both modules have a fairly extensive feature set and include extras: Migrate Extras for Migrate and Feeds Tamper for Feeds. Feeds seemed to be more common (60k installs compared to 10k installs for Migrate). Migrate, however, seemed to have better Drush support. One of the big differences is that Feeds is also designed for aggregation and continuous imports, meaning that it could be used to import from on-going sources like a blog or, as the name indicates, a feed.

During the early stages of development, the prospect of being able to run such an on-going import was highly attractive. Since the import data was incomplete and due to change, being able to run continuous additive imports would have allowed us to import on a regular basis with the data improving in its source databases.

As development went on, constraints changed and it turned out that the source data export was severely limited. Keeping up with the format changes required a lot of work through the Feeds Tamper UI. Eventually, it was no longer possible to handle all the issues arising with the Feeds Tamper UI; instead, a preprocessing step was needed. Preprocessing meant that an on-going import was no longer an option. I can’t actually say if Migrate would have been a better option; nevertheless, at this point, the import was 90% working with Feeds, so we just optimised it.

Feeds basics & preprocessing

While Feeds combined with Feeds Tamper is fairly powerful, making changes through the UI can become rather tedious, especially when the import format is unstable or index based. Things get even more complicated if you have to feature your Tampers and Feeds and deploy them to a staging or live site. Thus, if the use case is a one-time migrate import rather than a recurring import, it makes sense to preprocess as much data as possible

A quick run down of the modules needed to set this up:

Internationalization and its required modules are obviously needed because of the multilingual data. Feeds is responsible for the import, and I used Feeds Tamper to do light alterations on the import to avoid preprocessing at first and to do look ups later. I also used Features to move my Feeds and Tampers between dev and staging sites.

Basic Feeds setup

I had my own files to export, so I don’t use any of the scheduled or online pull settings. Here are my basic feeds settings:

  • Attached to: [none]
  • Periodic import: off
  • Import on submission
  • Fetcher: File upload
    • Upload directory: private://feeds
  • Parser: CSV parser
    • No headers: checkbox (take care here, if you go down the no headers path it can be painful!)
  • Processor: Node Processor
    • Bundle: ”Desired content type here”
  • Update existing nodes: update existing nodes (this only works if you have a unique identifier and/ or use multiple imports)
    • Text format: Plain text (depends on what you’re importing, markdown works nice with old html)
    • Author: I picked my main editor here
    • Authorize: This doesn’t really matter for a one time import
    • Expire nodes: NEVER! (unless you want data to disappear now and then)
  • Mappings - see more below.

Before I started preprocessing the data, the CSV parser setting was set to “No Headers”, meaning that I had to use the column indices for referring to the data. With more than 80 columns and changing order between new versions of the file, this turned out to be highly impractical. The preprocessing allowed me to add my own headers and keep them constant. It’s much faster to change the column index to the header name in a code file than with Feeds, because they cannot be edited. You’d have to delete the mappings and recreated them (along with associated tampers), which can take about 3-5 page loads. So my advice is to have a stable import file format - even if it requires a preprocessing step.

Mappings

This is a bit the table of doom, note that when I did this, there was no way to edit entries. Also, the delete, recreate process is a bit slow, so be careful!

Source: Text is better but if the CSV doesn’t use header it can be a number index.

Target: Node Field

Target Configuration: There are two main options here which depend on the target types

  • For title you can select unique - not so useful when you have similar titles (see section Feeds Tamper to make other fields unique)
  • For taxonomies you can select the search type - this is very important because it allows:
    • Configuring the search type - finds a taxonomy term with a matching name
    • Auto create - creates the taxonomy term if not found, great for data imports where you don’t have all the terms yet

This screen can feel a bit inflexible but we’ll make up for this with the Feeds Tamper, so read on!

Feeds Tamper

Feeds Tamper allows you to post process fields after import but before saving the nodes. This can make up for either lack of preprocessing or tasks which are impossible to do during preprocessing because they depend on data which isn’t available to the import file and only in Drupal. Especially when dealing with relational data, this is important. However, since I’ll cover relational data in the next article, let’s focus on everything else for now.

On the Tamper screen you see a list of all the fields you import and you can add Tamper plugins to each of them. Let’s look at a few examples of how this can be used. An important factor to keep in mind about Tampers is that they are mapping dependent - once you delete the mapping, you delete the Tampers. When you recreate a mapping you have to manually re-add Tampers.

Setting the node language

As mentioned before, in my initial import steps each import row contained fields of both languages. The import was done with two separate Feeds jobs importing the fields of a different language each. To set the node language, I used the following trick:

  1. Set an extra Mapping with a header name/ index which isn’t used and language as target (so it’s value will be blank)
  2. In Tamper add a Set default value plugin
  3. Set the Default value to en or de or whichever languages you are using (respectively per feeds job)

This will try to import a non existent field and keep it blank, then, before saving, Tamper will updated it to the desired language. After I added preprocessing, each row was one language only and included the language field in the import file, so I didn’t actually need this any more. Nevertheless, it can come in handy at times.

New lines

Some databases use different characters than new lines to store data or export it to CSV (CSV has a special meaning for new lines). This was the case for the FileMaker export. Many of the new lines were separated by vertical tabulations (also referred to as VT or \v).

This is where the “Find replace” REGEX plugin comes in. You can add your pattern and replace it with for instance \n in this case. I had trouble actually getting the REGEX to work correctly in Tamper because of some odd characters which I had to replace. I ended up doing so in PHP and replacing them with a random string (MAGICAL_NEWLINE), this way, \n wouldn’t interfere with my CSV file. Then I set up the following plugin:

  1. Find replace REGEX
  2. REGEX to find: /MAGICAL_NEWLINE/
  3. Replacement pattern: \n
  4. Limit number of replacements: <none>

Lists/ multiple values

What if we have a list of items and want to import them into one Drupal field which allows multiple values? We can, but it takes a few steps. I used my preprocessing trick from above again to deal with vertical tabs. Thus, MAGICAL_NEWLINE is the placeholder for my separation value but it could be comma separated or tabs as well.

This time, we actually need to use two plugins in a sequence:

  1. Explode plugin
    • String separator: MAGICAL_NEWLINE
    • Limit: &lt;none&gt;
  2. Filter empty items
    • No options here! (This is nice to make sure you don’t get any validation errors later about empty items)

This will automatically break it up and insert multiple entries into the field. You could do a lot more with Tampers, but these are the main tasks I used it for, even after preprocessing.

Import process

With the files uploaded into the private file system, I was able to easily import, delete, modify and repeat until I got what I wanted. However, this only works if you’re using a development site and can afford to delete the data. Having said that, if you can manage to set up a unique id on import, you could also do multiple imports. For me, this wasn’t possible, though, because the import data was stored with both languages in the same row and therefore both my EN and DE nodes would have had the same ”Import Unique ID” and overwrite each other. Again, this is an iterative process when you get started, so it really helps to have a development site where you can delete the data and restart the import to play with the settings until you get what you want.

Importing Multilingual Data

We now got both data and the language into Drupal. However, none of the nodes are linked to their corresponding translations yet. Preprocessing the data to add a language column and setting it to EN or DE is the easy part, linking the data on import is a bit more tricky.

Drupal’s nodes use the translation id (tnid) for this. Lets say we created two nodes and linked them. They would look like this:

Nid Language Tnid
1 EN 1
2 DE 1

There are two important points to notice: Drupal has a source and a translation node. In this case, the de node with nid 2 is the translation. It has a translation nid of 1 meaning node nid 1 in EN is its source node. So far so good, but for some reason, nid 1 has tnid 1 as well. This is crucial, since without this the translations won’t show up in the backend, even if the secondary languages point to the source.

I couldn’t find a good way to do it out of the box with Feeds or Feeds Tamper, but I did find a Feeds hook: hook_feeds_after_save. It gets called after the node is saved, which is just when we can go in and update the tnid to meet our needs. The main reason this is hard to do in Tamper is that it happens before save, when the nodes don’t have an id yet - So unless you’re from the future like my co-worker Rune, things are going to get tricky. Jokes aside, the goal of using this hook is that once the node is created, we can update the tnid with either the node’s own nid (if it’s the source language) or lookup the node which should be the source and use that nid for the tnid.

Before you may think that I left something out, let me explain exactly how I do this lookup from one language to another automatically. The code samples below will refer to sn_v or travel_offer_number, this is actually the database id from the database I imported from. I know that both languages have the same one since they originally came from the same row; so this process requires that you already have the language mapping existing in some form in the data which you import, even if it requires creating extra fields which will never be displayed in the frontend. This could also be generated or formatted via preprocessing to match your needs. Let’s get to the code!

  1. function feeds_my_modulename_feeds_after_save(FeedsSource $source, $entity, $item, $entity_id) {
  2. // Handle Travel Offer translations
  3. if ($entity->type == 'travel_offer') {
  4. // Load the node
  5. $node = node_load(array('nid' => $entity->nid));
  6.  
  7. // EN is source language
  8. if ($entity->language == 'en') {
  9. // Edit node with translation id of self to self
  10. $node->tnid = $entity->nid;
  11. }
  12. // DE is translation so lookup source
  13. else if ($entity->language == 'de') {
  14. // Load the sn_v number from the node's fields
  15. $fields = field_get_items('node', $node, 'field_travel_offer_number');
  16. if ($fields) {
  17. $field = reset($fields);
  18. $sn_v = $field['value'];
  19. } else {
  20. // Failed cannot update this node without looking up sn_v
  21. return;
  22. }
  23.  
  24. // look up the id of the node with the same sn_v in english:
  25. $query = new EntityFieldQuery();
  26. $query->entityCondition('entity_type', 'node');
  27. $query->propertyCondition('language', 'en');
  28. $query->propertyCondition('type', 'travel_offer');
  29. $query->fieldCondition('field_travel_offer_number', 'value', $sn_v, '=');
  30. $result = $query->execute();
  31.  
  32. if ($result) {
  33. // get first match (there should only be one)
  34. $translation_source_node = reset($result['node']);
  35.  
  36. // Set the translation id
  37. $node->tnid = $translation_source_node->nid;
  38. }
  39. }
  40.  
  41. // Finally save it
  42. node_save($node);
  43. }
  44. }

This hook gets called on every imported node. Here’s the run down of what it does:

  • Prequisite: Each node, wether de or en, has the same sn_v number.
  • Check the entity type - I had multiple imports, not all of which needed multilingual linking. The hook is global, it’s up to you to check that you only run it on the desired imports.
  • Check the language
    • If the language is your source language
    • Update tnid to self’s nid
    • If the language is the translation language
    • Look up your unique identifier/translation mapping id
    • Use it to look up the node of the other language which has a matching identifier
    • Update tnid to that source node’s nid
    • Save the node

And that’s it, now it’s set up the way Drupal expects it to be, and though none the wiser, Drupal can now display translations.

Conclusion

In this first part of the article I went over my basic Feeds set up and how I extended it using hooks to automatically link node translations to their source during the import. As mentioned, this requires that the mapping from source to translations exists prior to the import. In the next article I will extend on this, talk about the set up I used to import relational data using Entity References and explain how you can use Feeds to look up the references and set them on import automatically. This, combined with the multilingual import, allows you to import rather complicated data sets into Drupal.

* A sidenote on CSV encoding conversion: FileMaker in some cases exports in deprecated Mac OS encodings, which makes it hard to convert into another encoding. I managed to do it using Libre Office Calc and opening the CSV as Macintosh encoding, saving it to ods, then saving it back to CSV and this time exporting it to a standard unicode encoding.