Oh Solr. So much power, so much fumbling.
https://solr.apache.org/guide/solr/latest/index.html
Environment
I was able to run Solr in Windows, and utilise the multiline Curl commands in WSL by add
SOLR_JETTY_HOST=0.0.0.0
as last line to:
notepad C:\solr\solr-9.8.1\bin\solr.in.cmd
Stop and Start Solr.
This also makes it available on the LAN
curl http://192.168.0.10:8983/solr/admin/info/system from the WSL returns a lot of debugging information to verify.
Start switches
bin/solr start -p 8983 (this is one you want for dev/testing)
-c Core AND Collection depending on context
bin/solr create -c mycore (core name)
bin/solr create_collection -c customers (collection name)
-p Port 8983 is the typical one in the docs
-e Example (Don't use)
Core vs Collection
For now use Core (Stand-alone mode)
Note in the docs you can start a collection with the create command
Command | Solr Mode | Creates |
---|
bin/solr create -c mycore | Standalone | Core |
bin/solr create -c localDocs --shards 2 -rf 2 | SolrCloud | Collection |
bin/solr create_collection -c localDocs --shards 2 -rf 2 | SolrCloud | Collection |
Faceting and Clustering
https://solr.apache.org/guide/solr/latest/getting-started/searching-in-solr.html
Faceting is the arrangement of search results into categories (which are based on indexed terms). Within each category, Solr reports on the number of hits for relevant term, which is called a facet constraint. Faceting makes it easy for users to explore search results on sites such as movie sites and product review sites, where there are many categories and many items within a category.
Clustering groups search results by similarities discovered when a search is executed, rather than when content is indexed. The results of clustering often lack the neat hierarchical organization found in faceted search results, but clustering can be useful nonetheless. It can reveal unexpected commonalities among search results, and it can help users rule out content that isn’t pertinent to what they’re really searching for.
Schema
in Solr, the managed-schema.xml (or sometimes just managed-schema) is typically user-edited or managed through the Schema API, but Solr itself can write to it under specific circumstances.
Here's the breakdown:
✅ Solr can write to managed-schema:
If your core is configured to use a managed schema (not schema.xml), then Solr will update managed-schema automatically when you use:
The Schema API to add fields/dynamic fields/copy fields, etc.
Field guessing from documents if schema auto-creation is enabled (update.autoCreateFields=true in solrconfig.xml).
This file is managed by Solr internally in that case — and you should avoid editing it directly because:
Manual changes may be overwritten.
Mistakes can cause Solr startup errors.
✅ User-managed alternative: schema.xml
If you're using a classic schema.xml, then yes, Solr does not write to it, and it's fully user-edited. That mode is often preferred in controlled environments. [gut feeling says this is best to get to]
To switch to classic mode:
In solrconfig.xml, set <schemaFactory class="ClassicIndexSchemaFactory" />.
Rename or provide schema.xml instead of managed-schema.
🔄 How to tell if you're using a managed schema:
In solrconfig.xml, check for:
<schemaFactory class="ManagedIndexSchemaFactory">
This means you're using managed-schema.
once you want to lock it down
add...
<schemaFactory class="ManagedIndexSchemaFactory">
<bool name="mutable">false</bool>
</schemaFactory>
To solrconfig.xml
If you're still developing and want both worlds:
Fields and Copyfields
Use copyField
when your primary field is designed for full-text search but you also need exact match capabilities.
Gotchas
Unexpected Arrays
Arrays appearing in Solr for single value items,
<field name="description" type="text_general"/>
change to
<field name="description" type="text_general" multiValued="false"/>
*.*
Goddamit, solr fails sliently with zero results if I forget *:* due to DOS-legacy
Handling the Unique Key in managed-schema.xml
To set the "primary" "update key" in that file look for <uniqueKey> replace with this:
<uniqueKey>id</uniqueKey> it <uniqueKey>id</uniqueKey>
<field name="solr_id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
This part of the managed-schema is not auto-updated.
Handling location LatLonPoint
.
In managed-schema.xml
:
→ type="location"
= built-in for lat,lon
in vb.net:
solrDoc.Add("location", row("lat").ToString() & "," & row("lng").ToString())
Searching in Solr later:
Everything within 10km of a point:
Sort by nearest:
Useful urls
http://localhost:8983/solr/planning_applications/schema
http://localhost:8983/solr/planning_applications/schema/uniquekey (showed solr_id) which is my own unique key
http://localhost:8983/solr/admin/cores?action=RELOAD&core=planning_applications
Move to fixed schema
in Solrconfig.xml set:
<!-- The update.autoCreateFields property can be turned to false to disable schemaless mode -->
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:false}"
Testing
Create record.json with e.g.
[
{
"uid": "#24/00245/FULPP",
"authority": "Rushmoor",
"solr_id": "Rushmoor_2332342F30303234352F46554C5050",
"earliest_date": "18/04/2024Z",
"address": "6 Jubilee Close Farnborough Hampshire GU14 9TD",
"description": "Erection of a single storey side extension",
"lat": 51.2945,
"lng": -0.7885,
"app_size": "Small",
"app_state": "Permitted",
"app_type": "Full",
"linked_from_another_application": false,
"url": "https://publicaccess.rushmoor.gov.uk/online-applications/applicationDetails.do?activeTab=summary&keyVal=SC4OXKNMITN00",
"numAppeals": 0,
"location": "51.2945,-0.7885",
"lc_url": "/planning-applications/local-authority/Rushmoor/uid/2332342F30303234352F46554C5050",
"hex_uid": "2332342F30303234352F46554C5050",
"more_data": true
}
]
then
C:\solr\solr-9.8.1\bin>curl -X POST "http://localhost:8983/solr/planning_applications/update?commit=true" -H "Content-Type: application/json" --data-binary "@record.json"
{
"responseHeader":{
"status":400,
"QTime":128
},
"error":{
"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","java.time.format.DateTimeParseException"],
"msg":"ERROR: [doc=Rushmoor_2332342F30303234352F46554C5050] Error adding field 'earliest_date'='18/04/2024Z' msg=Invalid Date in Date Math String:'18/04/2024Z'",
"code":400
}
}
C:\solr\solr-9.8.1\bin>
Will give the actual error.
Backup
One single core just stop instance, then backup/zip
D:\solr\solr-9.8.1\server\solr\core_name\
e.g.
D:\solr\solr-9.8.1\server\solr\planning_applications\
This folder contains \conf \data and the core.properties file.
Searching
Text
Spatial
There are four main field types available for spatial search:
LatLonPointSpatialField
(has docValues
enabled by default)
PointType
SpatialRecursivePrefixTreeFieldType
(RPT for short), including RptWithGeometrySpatialField
, a derivative
BBoxField
LatLonPointSpatialField
is the ideal field type for the most common use-cases for lat-lon point data. RPT offers some more features for more advanced/custom use cases and options like polygons and heatmaps.
RptWithGeometrySpatialField
is for indexing and searching non-point data though it can do points too. It can’t do sorting/boosting.
BBoxField
is for indexing bounding boxes, querying by a box, specifying a search predicate (Intersects,Within,Contains,Disjoint,Equals), and a relevancy sort/boost like overlapRatio or simply the area.
Logs
Log Rotation
If you're using Solr as a service (like you are), make sure it's using logrotate
:
Add something like:
This keeps 7 days of logs, compressed. Change daily
to weekly
if needed.
Disable tlogs (if not needed)
In solrconfig.xml
, within your core:
Only do this if you're not using SolrCloud or don’t care about crash recovery.
Auto Start
sudo systemctl is-enabled solr check if enabled
sudo systemctl enable solr enable it
Migration from Zookeeper and shards to a single node.
- Each Shard had 50% of the records
- Logs: These are managed by log4j, and will grow unless rotation is configured. So this is important for all installs (see above)
- I was starting the CL "example" with cloud parameter etc.
- I moved the data from /example to into
/var/solr/data
and let the service manage Solr - that is sudo systemctl start solr and auto-restart etc. - Used this reindex.sh to copy data from one shard to a surviving one
GNU nano 7.2 reindex.sh #!/bin/bash
SRC="http://localhost:8983/solr/parallelTextShard2"
DEST="http://localhost:8983/solr/parallelText"
ROWS=500
START=0
while true; do
echo "Fetching $START..."
RESP=$(curl -s "$SRC/select?q=*:*&start=$START&rows=$ROWS&wt=json")
DOCS=$(echo "$RESP" | jq '.response.docs')
COUNT=$(echo "$DOCS" | jq 'length')
if [ "$COUNT" -eq 0 ]; then
echo "Done."
break
fi
echo "$DOCS" | jq 'map(del(._version_))' > docs.json #This line bined the solr version node that was causing it to reject as it's autogenerated.
curl -s "$DEST/update?commit=true" \
-H "Content-Type: application/json" \
--data-binary @docs.json
START=$((START + ROWS))
done
Comments
Post a Comment