This is the answer and explanation to the Solr puzzle on what happens during indexing, using a date as an example. In this blog post, we will dig into the complex and fascinating details of what those three simple commands cause behind the scenes.
So, let’s start from getting the server up and running. This example is based on Solr 5.5, though it should work in 6.0 the same.
This starts the server. As we are creating our own core, we are just starting a blank server with home in server/solr. If you are not clear where different start commands put Solr’s home and core directories, I have written about that before.
Now, for the first command
This should create a core. But with what configuration? Or will it crash and burn because we did not provide one, like in the answer option 1? Let’s look at the documentation:
So, the answer option 1 is incorrect and our new core is getting data_driven_schema_configs as a baseline. Which is ‘schemaless’ and the fields that are not already defined in the schema will be auto-defined. You can check the definitions in your own distribution or, for those away from their Solr setup, there is always source on the GitHub.
Now, let’s do the indexing:
To decompress, we are indexing content in CSV format and that content - provided inline - consists of only one record with a single field today which has the value 2016-04-08.
Should this complain due a missing uniqueKey (the answer option 2) or due to a bad date format (answer option 3)? It would in a static schema. But ours is - as we now know - a managed ‘schemaless’ one. So, it all depends on how the recognition patterns are configured. Which - for our example - is in the UpdateRequestProcessor chain add-unknown-fields-to-the-schema, starting on the line 1316 of the relevant solrconfig.xml.
We can see that the very first step in the URP chain, is UUIDUpdateProcessorFactory, which will generate us an ID, if one is not provided. So, the answer option 2 is not correct and we will be able to proceed even without explicit ID.
Onto the date. Date parsing is done by the last of the four explicit parsers, looking for booleans, longs, doubles, and dates in that order. All of the parsers can take parameters (including complex field selection criteria inherited from the parent class), but only date parser requires them explicitly. So, we have a long list of Java date formats we can recognize. 2016-04-08 would match yyyy-MM-dd on the line 1348. So, we answer option 3 is also incorrect and we will have successfully indexed our record with the multiValued date field created for today. And in fact, doing a *:* query will return us:
(if you did not get this, check your single quotes in the indexing command. Sometimes they get manged into smart quotes and make commands fail in mysterious ways. )
So, now for the curveball. A query by value Fri, which does not show up at all anywhere in either our submitted value or in the parsed displayed value.
Do we have no result (answer option 4) or get the record back against all odds (answer option 5)?
PAUSE a bit here if you haven’t figured it out already. We have already discarded 3 out of 5 options. Now, faced with a binary choice, can you figure the answer and - more importantly - WHY. If you cannot, run the actual commands above and try to figure it out from the information available in the schema, solrconfig.xml, and various Admin UI screens.
And when you are sure, read the rest of the explanation.